ATLAS vbox v2.02

Author	Message
entity Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0	Message 47117 - Posted: 9 Aug 2022, 19:32:13 UTC - in response to Message 47116. These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730. ID: 47117 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,987 RAC: 101,920	Message 47118 - Posted: 10 Aug 2022, 4:11:30 UTC - in response to Message 47112. Last modified: 10 Aug 2022, 4:32:02 UTC We need a Atlas-stop for this. CVMFS connect problem! Are there other users with so many CVMFS-connect problems? I have not so many ATLAS-tasks running, but no one failed on my side. All CVMFS-response times here are between 3 and at the most 8 seconds. I did not view all your results, but from your valid tasks the response times are between 3 and 81 seconds. Maybe there is somewhere a limit (90 sec.?) to get a response, else you will never get one or is rejected by the network, because too late? To me it seems to be a network issue on your side or CERN's side. If on CERN's side (max # connections?) more users would suffer from this. CP, thank you. You have no proxy. Since yesterday have for those two Threadripper also no Proxy. Testing again, when CentOS9-Stream VM with using Squid 5.5 is possible. All is well! ID: 47118 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,858,270 RAC: 2,555	Message 47119 - Posted: 10 Aug 2022, 5:49:59 UTC - in response to Message 47117. entity wrote: These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730. Is this one of your error tasks? https://lhcathome.cern.ch/lhcathome/result.php?resultid=362788360 ID: 47119 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,878,642 RAC: 57,393	Message 47120 - Posted: 10 Aug 2022, 7:20:15 UTC - in response to Message 47119. entity wrote: These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730. Is this one of your error tasks? https://lhcathome.cern.ch/lhcathome/result.php?resultid=362788360 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10802615 On computers like that a race condition may happen if many vbox tasks start concurrently. This is caused by a double workaround required to solve a vbox issue and (very likely) a vbox bug on top of that issue. See: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=578&postid=7708 The vbox developers refuse to correct the issue for years: "... we would therefore possibly need to bump the global config version. We don't want to do that though because that might make downgrading to pre-4.0 impossible." What to do? Option 1: The computer in question is running Linux. Hence, ATLAS native may be used instead of ATLAS vbox. Option 2: If ATLAS vbox is a must, ensure that at least the 1st ATLAS task of a fresh series starts a few seconds before all others. This task will prepare the disk entry in vbox for all other tasks. BOINC does not support such a staggered startup sequence out of the box. Hence, this has to be ensured by a self made script. ID: 47120 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,987 RAC: 101,920	Message 47121 - Posted: 10 Aug 2022, 7:28:46 UTC - in response to Message 47118. Testing again, when CentOS9-Stream VM with using Squid 5.5 is possible. This is a new test with squid 5.5 on a CentOS9-Stream VM. https://lhcathome.cern.ch/lhcathome/result.php?resultid=362998236 ID: 47121 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,878,642 RAC: 57,393	Message 47124 - Posted: 10 Aug 2022, 8:09:30 UTC - in response to Message 47121. This Squid is running inside a Linux VM on a Windows host, right? Hence, a couple of facts have to be taken into account: 1. Is the TCP stack of the Windows host able to deal with the expected huge number of concurrent connections? See: https://support.solarwinds.com/SuccessCenter/s/article/NETSTAT-A-command-displays-too-many-TCP-IP-connections?language=en_US 2. Is VirtualBox's network driver able to handle the number of concurrent TCP connections (in both directions)? To avoid those kind of errors in a heavy load environment it is recommended not to run Squid on a VM. 3. Is the TCP stack of the Linux VM able to deal with the expected huge number of concurrent connections? 4. Squid 5.x is still not certified by CERN/Fermilab. There might be issues regarding uploads of huge files. ID: 47124 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,987 RAC: 101,920	Message 47126 - Posted: 10 Aug 2022, 10:02:35 UTC - in response to Message 47124. CentOS9 Stream VM is only installed to test Squid 5.5, nothing else. CentOS8 Stream VM does the Squid 4,15, up to this multiattach Atlas with no problems anyway. So, have disconnected this Squid 4,15 from the two Threadripper. Searching atm for a idea, where this problem with Atlas multiattach come from. Theory and CMS are not tranfered to multiattach in production atm. Knowing well, Squid 5.5 is experimentell. Each update for CentOS9 Stream, one this morning, goes further. ID: 47126 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,878,642 RAC: 57,393	Message 47127 - Posted: 10 Aug 2022, 10:27:14 UTC - in response to Message 47126. Your CentOS8 is also a VM, right? Hence, (1.), (2.) and (3.) from my previous post also apply to this VM. You always write that you test (and sometimes fail) higher levels (e.g. Squid) but you never mention whether you adjusted the basics (e.g. #network connections on the host). MultiAttach is an attibute that affects the way how vbox uses the disk images. It has nothing to do with the network setup a VM does while it boots. The latter is configured long after the disks are set up. ID: 47127 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,987 RAC: 101,920	Message 47128 - Posted: 10 Aug 2022, 10:29:00 UTC - in response to Message 47127. Knowing this from multiattach. You can check my configuration, if you want. ID: 47128 · Reply Quote

entity Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0	Message 47129 - Posted: 10 Aug 2022, 11:12:16 UTC - in response to Message 47120. entity wrote: These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730. Is this one of your error tasks? https://lhcathome.cern.ch/lhcathome/result.php?resultid=362788360 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10802615 On computers like that a race condition may happen if many vbox tasks start concurrently. This is caused by a double workaround required to solve a vbox issue and (very likely) a vbox bug on top of that issue. See: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=578&postid=7708 The vbox developers refuse to correct the issue for years: "... we would therefore possibly need to bump the global config version. We don't want to do that though because that might make downgrading to pre-4.0 impossible." What to do? Option 1: The computer in question is running Linux. Hence, ATLAS native may be used instead of ATLAS vbox. Option 2: If ATLAS vbox is a must, ensure that at least the 1st ATLAS task of a fresh series starts a few seconds before all others. This task will prepare the disk entry in vbox for all other tasks. BOINC does not support such a staggered startup sequence out of the box. Hence, this has to be ensured by a self made script. To provide answers to posted questions: 1. Yes, that looks like one of the error tasks 2. When the problem first occurred, at least 5 ATLAS tasks were trying to start at the same time. This hasn't been a problem in the past but will try to prevent this in the future, BTW, I rebooted the machine and then tried to start one ATLAS task. Same error. Just in case it makes a difference, there were 30 theory tasks, 8 CMS tasks, and about 20 sixtrack tasks running at the same time. Would CMS or Theory have any bearing on this problem? 3. I have considered native but I'm in a temporary reduced computing state at the moment before moving to a new location. After the move, I may try the native approach. Until then I'm kind of stuck with VBox. Thanks for the responses ID: 47129 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,878,642 RAC: 57,393	Message 47130 - Posted: 10 Aug 2022, 11:28:18 UTC - in response to Message 47129. 2. When the problem first occurred, at least 5 ATLAS tasks were trying to start at the same time. This hasn't been a problem in the past but will try to prevent this in the future, BTW, I rebooted the machine and then tried to start one ATLAS task. Same error. Just in case it makes a difference, there were 30 theory tasks, 8 CMS tasks, and about 20 sixtrack tasks running at the same time. Would CMS or Theory have any bearing on this problem Out of my own experience 5 concurrently starting ATLAS tasks shouldn't be a problem on a computer like that (but you wrote "at least"). Sixtrack doesn't need vbox. CMS and Theory did not yet switch to 'multiattach'. I rebooted the machine and then tried to start one ATLAS task. This appears to be weird. Can you provide a link to that task log? You may set the client to 'no new tasks', stop all ATLAS task not yet running and ensure no ATLAS task is in progress. Then remove all ATLAS related disk entries from the VirtualBox Media Manager (keep only the parent disk file). Then restart 1 ATLAS task and if this succeeds start the others. At the end resume work fetch. ID: 47130 · Reply Quote

entity Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0	Message 47131 - Posted: 10 Aug 2022, 12:30:37 UTC - in response to Message 47130. Only 5 were trying to start as that is what I have set in app_config as the max concurrent. Unfortunately, I can't provide the link as that was run under an account that I can't logon to.. No ATLAS tasks have been attempted since that one so it would show up under the ce6931730 ID as the last ATLAS task returned. Is Vbox Media Manager a GUI tool? If so, it isn't available to me on this server as there is no GUI interface (no desktop) installed. Is there a CLI tool available that does the same thing? We may be thinking the same thing, that there might be something amiss in the Vbox config. That was the reason for the reboot yesterday. ID: 47131 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 47132 - Posted: 10 Aug 2022, 12:40:00 UTC Last modified: 10 Aug 2022, 12:44:32 UTC I thought I should mention that not only is Hyper V incompatible with VirtualBox, but apparently so is WSL2. I have WSL2 installed on my Win10 machine, and am running BOINC 7.16.6 just fine under Ubuntu 20.04.4. Also, I have VirtualBox 6.1.36 installed on the Windows side, and also BOINC 7.20.2, where VB shows up properly in the log file. But LHC does not see it (neither does Rosetta for that matter). "Virtualbox (6.1.36) installed, CPU does not have hardware virtualization support" I was able to run VBox just fine before enabling WSL2. So you choose one or the other. ID: 47132 · Reply Quote

entity Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0	Message 47133 - Posted: 10 Aug 2022, 13:17:43 UTC - in response to Message 47131. Only 5 were trying to start as that is what I have set in app_config as the max concurrent. Unfortunately, I can't provide the link as that was run under an account that I can't logon to.. No ATLAS tasks have been attempted since that one so it would show up under the ce6931730 ID as the last ATLAS task returned. Is Vbox Media Manager a GUI tool? If so, it isn't available to me on this server as there is no GUI interface (no desktop) installed. Is there a CLI tool available that does the same thing? We may be thinking the same thing, that there might be something amiss in the Vbox config. That was the reason for the reboot yesterday. I think I may have found the problem. Looking at the Virtual Box config xml files, I can see an ATLAS medium entry with a filename pointing to a slot that doesn't exist. I think I may be able to fix this with the vboxmanage CLI ID: 47133 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,878,642 RAC: 57,393	Message 47135 - Posted: 10 Aug 2022, 13:38:15 UTC - in response to Message 47131. Probably the most reliable way might be to wait until all your running VMs for the affected user account are finished and automatically deregistered. If the assumption is correct that you only run BOINC VMs (no additional own created ones) then there shouldn't be any '.vbox' files left (below BOINC's 'slots' directory). If no '.vbox' files are present there's just 1 file left to check - it's 'VirtualBox.xml', usually located in the user's home: ~/.config/VirtualBox/VirtualBox.xml Edit this file, locate the 'MediaRegistry' and remove the entry (the complete line) for the ATLAS vdi file. It looks like this: <MediaRegistry> <HardDisks> <HardDisk uuid="{f888c51e-7603-4495-8794-fd67809dc4e8}" location="/path/to/ATLAS_vbox_2.02_image.vdi" format="VDI" type="Normal"/> </HardDisks> </MediaRegistry> It's also fine if no MediaRegistry exists. Then start 1 fresh ATLAS task. VirtualBox should now write the MediaRegistry to the freshly created '*.vbox' file (slots dir). That's nearly the steps vboxwrapper should do automatically. ID: 47135 · Reply Quote

entity Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0	Message 47138 - Posted: 10 Aug 2022, 14:33:59 UTC - in response to Message 47135. Last modified: 10 Aug 2022, 14:58:04 UTC It was an orphaned snapshot file located under the parent ATLAS vdi file. It had it's own UUID assigned to it and was marked as inaccessible. I used the vboxmanage closemedium <UUID> command and it disappeared. Now the only thing left is the parent ATLAS vdi file. Should that parent be removed also? Update: had to remove the parent as the snapshot came back after the closemedium command was issued. Once the parent was closed using the closemedium command, the Media Registry in the VirtualBox.xml file disappeared also. Hopefully ATLAS is clean now. ID: 47138 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,878,642 RAC: 57,393	Message 47140 - Posted: 10 Aug 2022, 15:01:44 UTC - in response to Message 47138. It's not a must, but as long as the parent vdi file is not attached to any VM you can safely close it using "vboxmanage closemedium <parent_file\|UUID>' again. It will be re-registered automatically by the next starting VM. You should only remove the file itself if you suspect it's corrupt. ID: 47140 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,987 RAC: 101,920	Message 47141 - Posted: 11 Aug 2022, 2:30:23 UTC - in response to Message 47128. For CVMFS probe in vbox v2.02: What are the names of the Cern-Servers? ID: 47141 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,858,270 RAC: 2,555	Message 47142 - Posted: 11 Aug 2022, 7:34:14 UTC I can't explain this computation error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=363119537 Everything was clean in VirtualBox. Result log says: Could not add storage controller(s) to VM and because of that of course could not start a VM. There is an unknown option: --sataportcount mentioned, although the command given should be --portcount 3. Where is sata added from? I tried a new ATLAS-task without doing something myself and that's running now without a problem. Easter pentecost case? ID: 47142 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,878,642 RAC: 57,393	Message 47144 - Posted: 11 Aug 2022, 8:53:37 UTC - in response to Message 47142. Very strange, see: https://github.com/BOINC/boinc/blob/76dffd0dda2ee7a7f881e3dd08c35edec497a504/samples/vboxwrapper/vbox_vboxmanage.cpp#L449-L455 For some unknown reason the function "is_virtualbox_version_newer(4, 3, 0)" returned your vbox version is not newer than 4.3.0 although stderr.txt shows it is 6.1.36. Hence the wrong option "sataportcount" instead of "portcount". BTW: VirtualBox sometimes allows to use command option names from older versions even in newer versions, sometimes (like here) they don't allow it. ID: 47144 · Reply Quote

LHC@home