Message boards :
ATLAS application :
ATLAS vbox v2.02
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0 |
These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,988,818 RAC: 7,494 |
We need a Atlas-stop for this. CVMFS connect problem!Are there other users with so many CVMFS-connect problems? CP, thank you. You have no proxy. Since yesterday have for those two Threadripper also no Proxy. Testing again, when CentOS9-Stream VM with using Squid 5.5 is possible. All is well! |
Send message Joined: 14 Jan 10 Posts: 1426 Credit: 9,492,631 RAC: 860 |
entity wrote: These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730.Is this one of your error tasks? https://lhcathome.cern.ch/lhcathome/result.php?resultid=362788360 |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,263,337 RAC: 56,858 |
entity wrote:These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730.Is this one of your error tasks? https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10802615 On computers like that a race condition may happen if many vbox tasks start concurrently. This is caused by a double workaround required to solve a vbox issue and (very likely) a vbox bug on top of that issue. See: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=578&postid=7708 The vbox developers refuse to correct the issue for years: "... we would therefore possibly need to bump the global config version. We don't want to do that though because that might make downgrading to pre-4.0 impossible." What to do? Option 1: The computer in question is running Linux. Hence, ATLAS native may be used instead of ATLAS vbox. Option 2: If ATLAS vbox is a must, ensure that at least the 1st ATLAS task of a fresh series starts a few seconds before all others. This task will prepare the disk entry in vbox for all other tasks. BOINC does not support such a staggered startup sequence out of the box. Hence, this has to be ensured by a self made script. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,988,818 RAC: 7,494 |
Testing again, when CentOS9-Stream VM with using Squid 5.5 is possible. This is a new test with squid 5.5 on a CentOS9-Stream VM. https://lhcathome.cern.ch/lhcathome/result.php?resultid=362998236 |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,263,337 RAC: 56,858 |
This Squid is running inside a Linux VM on a Windows host, right? Hence, a couple of facts have to be taken into account: 1. Is the TCP stack of the Windows host able to deal with the expected huge number of concurrent connections? See: https://support.solarwinds.com/SuccessCenter/s/article/NETSTAT-A-command-displays-too-many-TCP-IP-connections?language=en_US 2. Is VirtualBox's network driver able to handle the number of concurrent TCP connections (in both directions)? To avoid those kind of errors in a heavy load environment it is recommended not to run Squid on a VM. 3. Is the TCP stack of the Linux VM able to deal with the expected huge number of concurrent connections? 4. Squid 5.x is still not certified by CERN/Fermilab. There might be issues regarding uploads of huge files. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,988,818 RAC: 7,494 |
CentOS9 Stream VM is only installed to test Squid 5.5, nothing else. CentOS8 Stream VM does the Squid 4,15, up to this multiattach Atlas with no problems anyway. So, have disconnected this Squid 4,15 from the two Threadripper. Searching atm for a idea, where this problem with Atlas multiattach come from. Theory and CMS are not tranfered to multiattach in production atm. Knowing well, Squid 5.5 is experimentell. Each update for CentOS9 Stream, one this morning, goes further. |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,263,337 RAC: 56,858 |
Your CentOS8 is also a VM, right? Hence, (1.), (2.) and (3.) from my previous post also apply to this VM. You always write that you test (and sometimes fail) higher levels (e.g. Squid) but you never mention whether you adjusted the basics (e.g. #network connections on the host). MultiAttach is an attibute that affects the way how vbox uses the disk images. It has nothing to do with the network setup a VM does while it boots. The latter is configured long after the disks are set up. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,988,818 RAC: 7,494 |
Knowing this from multiattach. You can check my configuration, if you want. |
Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0 |
entity wrote:These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730.Is this one of your error tasks? To provide answers to posted questions: 1. Yes, that looks like one of the error tasks 2. When the problem first occurred, at least 5 ATLAS tasks were trying to start at the same time. This hasn't been a problem in the past but will try to prevent this in the future, BTW, I rebooted the machine and then tried to start one ATLAS task. Same error. Just in case it makes a difference, there were 30 theory tasks, 8 CMS tasks, and about 20 sixtrack tasks running at the same time. Would CMS or Theory have any bearing on this problem? 3. I have considered native but I'm in a temporary reduced computing state at the moment before moving to a new location. After the move, I may try the native approach. Until then I'm kind of stuck with VBox. Thanks for the responses |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,263,337 RAC: 56,858 |
2. When the problem first occurred, at least 5 ATLAS tasks were trying to start at the same time. This hasn't been a problem in the past but will try to prevent this in the future, BTW, I rebooted the machine and then tried to start one ATLAS task. Same error. Just in case it makes a difference, there were 30 theory tasks, 8 CMS tasks, and about 20 sixtrack tasks running at the same time. Would CMS or Theory have any bearing on this problem Out of my own experience 5 concurrently starting ATLAS tasks shouldn't be a problem on a computer like that (but you wrote "at least"). Sixtrack doesn't need vbox. CMS and Theory did not yet switch to 'multiattach'. I rebooted the machine and then tried to start one ATLAS task. This appears to be weird. Can you provide a link to that task log? You may set the client to 'no new tasks', stop all ATLAS task not yet running and ensure no ATLAS task is in progress. Then remove all ATLAS related disk entries from the VirtualBox Media Manager (keep only the parent disk file). Then restart 1 ATLAS task and if this succeeds start the others. At the end resume work fetch. |
Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0 |
Only 5 were trying to start as that is what I have set in app_config as the max concurrent. Unfortunately, I can't provide the link as that was run under an account that I can't logon to.. No ATLAS tasks have been attempted since that one so it would show up under the ce6931730 ID as the last ATLAS task returned. Is Vbox Media Manager a GUI tool? If so, it isn't available to me on this server as there is no GUI interface (no desktop) installed. Is there a CLI tool available that does the same thing? We may be thinking the same thing, that there might be something amiss in the Vbox config. That was the reason for the reboot yesterday. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I thought I should mention that not only is Hyper V incompatible with VirtualBox, but apparently so is WSL2. I have WSL2 installed on my Win10 machine, and am running BOINC 7.16.6 just fine under Ubuntu 20.04.4. Also, I have VirtualBox 6.1.36 installed on the Windows side, and also BOINC 7.20.2, where VB shows up properly in the log file. But LHC does not see it (neither does Rosetta for that matter). "Virtualbox (6.1.36) installed, CPU does not have hardware virtualization support" I was able to run VBox just fine before enabling WSL2. So you choose one or the other. |
Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0 |
Only 5 were trying to start as that is what I have set in app_config as the max concurrent. I think I may have found the problem. Looking at the Virtual Box config xml files, I can see an ATLAS medium entry with a filename pointing to a slot that doesn't exist. I think I may be able to fix this with the vboxmanage CLI |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,263,337 RAC: 56,858 |
Probably the most reliable way might be to wait until all your running VMs for the affected user account are finished and automatically deregistered. If the assumption is correct that you only run BOINC VMs (no additional own created ones) then there shouldn't be any '*.vbox' files left (below BOINC's 'slots' directory). If no '*.vbox' files are present there's just 1 file left to check - it's 'VirtualBox.xml', usually located in the user's home: ~/.config/VirtualBox/VirtualBox.xml Edit this file, locate the 'MediaRegistry' and remove the entry (the complete line) for the ATLAS vdi file. It looks like this: <MediaRegistry> <HardDisks> <HardDisk uuid="{f888c51e-7603-4495-8794-fd67809dc4e8}" location="/path/to/ATLAS_vbox_2.02_image.vdi" format="VDI" type="Normal"/> </HardDisks> </MediaRegistry> It's also fine if no MediaRegistry exists. Then start 1 fresh ATLAS task. VirtualBox should now write the MediaRegistry to the freshly created '*.vbox' file (slots dir). That's nearly the steps vboxwrapper should do automatically. |
Send message Joined: 7 May 17 Posts: 6 Credit: 695,132 RAC: 0 |
It was an orphaned snapshot file located under the parent ATLAS vdi file. It had it's own UUID assigned to it and was marked as inaccessible. I used the vboxmanage closemedium <UUID> command and it disappeared. Now the only thing left is the parent ATLAS vdi file. Should that parent be removed also? Update: had to remove the parent as the snapshot came back after the closemedium command was issued. Once the parent was closed using the closemedium command, the Media Registry in the VirtualBox.xml file disappeared also. Hopefully ATLAS is clean now. |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,263,337 RAC: 56,858 |
It's not a must, but as long as the parent vdi file is not attached to any VM you can safely close it using "vboxmanage closemedium <parent_file|UUID>' again. It will be re-registered automatically by the next starting VM. You should only remove the file itself if you suspect it's corrupt. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,988,818 RAC: 7,494 |
For CVMFS probe in vbox v2.02: What are the names of the Cern-Servers? |
Send message Joined: 14 Jan 10 Posts: 1426 Credit: 9,492,631 RAC: 860 |
I can't explain this computation error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=363119537 Everything was clean in VirtualBox. Result log says: Could not add storage controller(s) to VM and because of that of course could not start a VM. There is an unknown option: --sataportcount mentioned, although the command given should be --portcount 3. Where is sata added from? I tried a new ATLAS-task without doing something myself and that's running now without a problem. Easter pentecost case? |
Send message Joined: 15 Jun 08 Posts: 2549 Credit: 255,263,337 RAC: 56,858 |
Very strange, see: https://github.com/BOINC/boinc/blob/76dffd0dda2ee7a7f881e3dd08c35edec497a504/samples/vboxwrapper/vbox_vboxmanage.cpp#L449-L455 For some unknown reason the function "is_virtualbox_version_newer(4, 3, 0)" returned your vbox version is not newer than 4.3.0 although stderr.txt shows it is 6.1.36. Hence the wrong option "sataportcount" instead of "portcount". BTW: VirtualBox sometimes allows to use command option names from older versions even in newer versions, sometimes (like here) they don't allow it. |
©2025 CERN