Message boards : ATLAS application : ATLAS vbox v2.02
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
entity

Send message
Joined: 7 May 17
Posts: 6
Credit: 695,132
RAC: 0
Message 47117 - Posted: 9 Aug 2022, 19:32:13 UTC - in response to Message 47116.  

These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730.
ID: 47117 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 47118 - Posted: 10 Aug 2022, 4:11:30 UTC - in response to Message 47112.  
Last modified: 10 Aug 2022, 4:32:02 UTC

We need a Atlas-stop for this. CVMFS connect problem!
Are there other users with so many CVMFS-connect problems?
I have not so many ATLAS-tasks running, but no one failed on my side.
All CVMFS-response times here are between 3 and at the most 8 seconds.
I did not view all your results, but from your valid tasks the response times are between 3 and 81 seconds.
Maybe there is somewhere a limit (90 sec.?) to get a response, else you will never get one or is rejected by the network, because too late?
To me it seems to be a network issue on your side or CERN's side. If on CERN's side (max # connections?) more users would suffer from this.

CP, thank you. You have no proxy. Since yesterday have for those two Threadripper also no Proxy.
Testing again, when CentOS9-Stream VM with using Squid 5.5 is possible.
All is well!
ID: 47118 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,903
RAC: 2,069
Message 47119 - Posted: 10 Aug 2022, 5:49:59 UTC - in response to Message 47117.  

entity wrote:
These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730.
Is this one of your error tasks?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=362788360
ID: 47119 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,399,553
RAC: 131,759
Message 47120 - Posted: 10 Aug 2022, 7:20:15 UTC - in response to Message 47119.  

entity wrote:
These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730.
Is this one of your error tasks?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=362788360

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10802615

On computers like that a race condition may happen if many vbox tasks start concurrently.
This is caused by a double workaround required to solve a vbox issue and (very likely) a vbox bug on top of that issue.
See:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=578&postid=7708

The vbox developers refuse to correct the issue for years:
"... we would therefore possibly need to bump the global config version. We don't want to do that though because that might make downgrading to pre-4.0 impossible."



What to do?

Option 1:
The computer in question is running Linux.
Hence, ATLAS native may be used instead of ATLAS vbox.


Option 2:
If ATLAS vbox is a must, ensure that at least the 1st ATLAS task of a fresh series starts a few seconds before all others.
This task will prepare the disk entry in vbox for all other tasks.
BOINC does not support such a staggered startup sequence out of the box.
Hence, this has to be ensured by a self made script.
ID: 47120 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 47121 - Posted: 10 Aug 2022, 7:28:46 UTC - in response to Message 47118.  

Testing again, when CentOS9-Stream VM with using Squid 5.5 is possible.

This is a new test with squid 5.5 on a CentOS9-Stream VM.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=362998236
ID: 47121 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,399,553
RAC: 131,759
Message 47124 - Posted: 10 Aug 2022, 8:09:30 UTC - in response to Message 47121.  

This Squid is running inside a Linux VM on a Windows host, right?
Hence, a couple of facts have to be taken into account:

1. Is the TCP stack of the Windows host able to deal with the expected huge number of concurrent connections?
See: https://support.solarwinds.com/SuccessCenter/s/article/NETSTAT-A-command-displays-too-many-TCP-IP-connections?language=en_US

2. Is VirtualBox's network driver able to handle the number of concurrent TCP connections (in both directions)?
To avoid those kind of errors in a heavy load environment it is recommended not to run Squid on a VM.

3. Is the TCP stack of the Linux VM able to deal with the expected huge number of concurrent connections?

4. Squid 5.x is still not certified by CERN/Fermilab.
There might be issues regarding uploads of huge files.
ID: 47124 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 47126 - Posted: 10 Aug 2022, 10:02:35 UTC - in response to Message 47124.  

CentOS9 Stream VM is only installed to test Squid 5.5, nothing else.
CentOS8 Stream VM does the Squid 4,15, up to this multiattach Atlas with no problems anyway.
So, have disconnected this Squid 4,15 from the two Threadripper.
Searching atm for a idea, where this problem with Atlas multiattach come from.
Theory and CMS are not tranfered to multiattach in production atm.
Knowing well, Squid 5.5 is experimentell. Each update for CentOS9 Stream,
one this morning, goes further.
ID: 47126 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,399,553
RAC: 131,759
Message 47127 - Posted: 10 Aug 2022, 10:27:14 UTC - in response to Message 47126.  

Your CentOS8 is also a VM, right?
Hence, (1.), (2.) and (3.) from my previous post also apply to this VM.

You always write that you test (and sometimes fail) higher levels (e.g. Squid) but you never mention whether you adjusted the basics (e.g. #network connections on the host).


MultiAttach is an attibute that affects the way how vbox uses the disk images.
It has nothing to do with the network setup a VM does while it boots.
The latter is configured long after the disks are set up.
ID: 47127 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 47128 - Posted: 10 Aug 2022, 10:29:00 UTC - in response to Message 47127.  

Knowing this from multiattach.
You can check my configuration, if you want.
ID: 47128 · Report as offensive     Reply Quote
entity

Send message
Joined: 7 May 17
Posts: 6
Credit: 695,132
RAC: 0
Message 47129 - Posted: 10 Aug 2022, 11:12:16 UTC - in response to Message 47120.  

entity wrote:
These are not running under the "entity" ID. I am using an account manager that creates an account that I cannot logon on to. These are running under ID ce6931730.
Is this one of your error tasks?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=362788360

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10802615

On computers like that a race condition may happen if many vbox tasks start concurrently.
This is caused by a double workaround required to solve a vbox issue and (very likely) a vbox bug on top of that issue.
See:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=578&postid=7708

The vbox developers refuse to correct the issue for years:
"... we would therefore possibly need to bump the global config version. We don't want to do that though because that might make downgrading to pre-4.0 impossible."



What to do?

Option 1:
The computer in question is running Linux.
Hence, ATLAS native may be used instead of ATLAS vbox.


Option 2:
If ATLAS vbox is a must, ensure that at least the 1st ATLAS task of a fresh series starts a few seconds before all others.
This task will prepare the disk entry in vbox for all other tasks.
BOINC does not support such a staggered startup sequence out of the box.
Hence, this has to be ensured by a self made script.

To provide answers to posted questions:

1. Yes, that looks like one of the error tasks

2. When the problem first occurred, at least 5 ATLAS tasks were trying to start at the same time. This hasn't been a problem in the past but will try to prevent this in the future, BTW, I rebooted the machine and then tried to start one ATLAS task. Same error. Just in case it makes a difference, there were 30 theory tasks, 8 CMS tasks, and about 20 sixtrack tasks running at the same time. Would CMS or Theory have any bearing on this problem?

3. I have considered native but I'm in a temporary reduced computing state at the moment before moving to a new location. After the move, I may try the native approach. Until then I'm kind of stuck with VBox.

Thanks for the responses
ID: 47129 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,399,553
RAC: 131,759
Message 47130 - Posted: 10 Aug 2022, 11:28:18 UTC - in response to Message 47129.  

2. When the problem first occurred, at least 5 ATLAS tasks were trying to start at the same time. This hasn't been a problem in the past but will try to prevent this in the future, BTW, I rebooted the machine and then tried to start one ATLAS task. Same error. Just in case it makes a difference, there were 30 theory tasks, 8 CMS tasks, and about 20 sixtrack tasks running at the same time. Would CMS or Theory have any bearing on this problem


Out of my own experience 5 concurrently starting ATLAS tasks shouldn't be a problem on a computer like that (but you wrote "at least").

Sixtrack doesn't need vbox.
CMS and Theory did not yet switch to 'multiattach'.

I rebooted the machine and then tried to start one ATLAS task.

This appears to be weird.
Can you provide a link to that task log?
You may set the client to 'no new tasks', stop all ATLAS task not yet running and ensure no ATLAS task is in progress.
Then remove all ATLAS related disk entries from the VirtualBox Media Manager (keep only the parent disk file).

Then restart 1 ATLAS task and if this succeeds start the others.
At the end resume work fetch.
ID: 47130 · Report as offensive     Reply Quote
entity

Send message
Joined: 7 May 17
Posts: 6
Credit: 695,132
RAC: 0
Message 47131 - Posted: 10 Aug 2022, 12:30:37 UTC - in response to Message 47130.  

Only 5 were trying to start as that is what I have set in app_config as the max concurrent.

Unfortunately, I can't provide the link as that was run under an account that I can't logon to.. No ATLAS tasks have been attempted since that one so it would show up under the ce6931730 ID as the last ATLAS task returned.

Is Vbox Media Manager a GUI tool? If so, it isn't available to me on this server as there is no GUI interface (no desktop) installed. Is there a CLI tool available that does the same thing? We may be thinking the same thing, that there might be something amiss in the Vbox config. That was the reason for the reboot yesterday.
ID: 47131 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 47132 - Posted: 10 Aug 2022, 12:40:00 UTC
Last modified: 10 Aug 2022, 12:44:32 UTC

I thought I should mention that not only is Hyper V incompatible with VirtualBox, but apparently so is WSL2.

I have WSL2 installed on my Win10 machine, and am running BOINC 7.16.6 just fine under Ubuntu 20.04.4.
Also, I have VirtualBox 6.1.36 installed on the Windows side, and also BOINC 7.20.2, where VB shows up properly in the log file.
But LHC does not see it (neither does Rosetta for that matter).
"Virtualbox (6.1.36) installed, CPU does not have hardware virtualization support"

I was able to run VBox just fine before enabling WSL2.
So you choose one or the other.
ID: 47132 · Report as offensive     Reply Quote
entity

Send message
Joined: 7 May 17
Posts: 6
Credit: 695,132
RAC: 0
Message 47133 - Posted: 10 Aug 2022, 13:17:43 UTC - in response to Message 47131.  

Only 5 were trying to start as that is what I have set in app_config as the max concurrent.

Unfortunately, I can't provide the link as that was run under an account that I can't logon to.. No ATLAS tasks have been attempted since that one so it would show up under the ce6931730 ID as the last ATLAS task returned.

Is Vbox Media Manager a GUI tool? If so, it isn't available to me on this server as there is no GUI interface (no desktop) installed. Is there a CLI tool available that does the same thing? We may be thinking the same thing, that there might be something amiss in the Vbox config. That was the reason for the reboot yesterday.

I think I may have found the problem. Looking at the Virtual Box config xml files, I can see an ATLAS medium entry with a filename pointing to a slot that doesn't exist. I think I may be able to fix this with the vboxmanage CLI
ID: 47133 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,399,553
RAC: 131,759
Message 47135 - Posted: 10 Aug 2022, 13:38:15 UTC - in response to Message 47131.  

Probably the most reliable way might be to wait until all your running VMs for the affected user account are finished and automatically deregistered.
If the assumption is correct that you only run BOINC VMs (no additional own created ones) then there shouldn't be any '*.vbox' files left (below BOINC's 'slots' directory).
If no '*.vbox' files are present there's just 1 file left to check - it's 'VirtualBox.xml', usually located in the user's home:
~/.config/VirtualBox/VirtualBox.xml

Edit this file, locate the 'MediaRegistry' and remove the entry (the complete line) for the ATLAS vdi file.
It looks like this:
    <MediaRegistry>
      <HardDisks>
        <HardDisk uuid="{f888c51e-7603-4495-8794-fd67809dc4e8}" location="/path/to/ATLAS_vbox_2.02_image.vdi" format="VDI" type="Normal"/>
      </HardDisks>
    </MediaRegistry>


It's also fine if no MediaRegistry exists.


Then start 1 fresh ATLAS task.
VirtualBox should now write the MediaRegistry to the freshly created '*.vbox' file (slots dir).


That's nearly the steps vboxwrapper should do automatically.
ID: 47135 · Report as offensive     Reply Quote
entity

Send message
Joined: 7 May 17
Posts: 6
Credit: 695,132
RAC: 0
Message 47138 - Posted: 10 Aug 2022, 14:33:59 UTC - in response to Message 47135.  
Last modified: 10 Aug 2022, 14:58:04 UTC

It was an orphaned snapshot file located under the parent ATLAS vdi file. It had it's own UUID assigned to it and was marked as inaccessible. I used the vboxmanage closemedium <UUID> command and it disappeared. Now the only thing left is the parent ATLAS vdi file. Should that parent be removed also?

Update: had to remove the parent as the snapshot came back after the closemedium command was issued. Once the parent was closed using the closemedium command, the Media Registry in the VirtualBox.xml file disappeared also. Hopefully ATLAS is clean now.
ID: 47138 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,399,553
RAC: 131,759
Message 47140 - Posted: 10 Aug 2022, 15:01:44 UTC - in response to Message 47138.  

It's not a must, but as long as the parent vdi file is not attached to any VM you can safely close it using "vboxmanage closemedium <parent_file|UUID>' again. It will be re-registered automatically by the next starting VM.

You should only remove the file itself if you suspect it's corrupt.
ID: 47140 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 47141 - Posted: 11 Aug 2022, 2:30:23 UTC - in response to Message 47128.  

For CVMFS probe in vbox v2.02:
What are the names of the Cern-Servers?
ID: 47141 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,903
RAC: 2,069
Message 47142 - Posted: 11 Aug 2022, 7:34:14 UTC

I can't explain this computation error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=363119537

Everything was clean in VirtualBox. Result log says: Could not add storage controller(s) to VM and because of that of course could not start a VM.
There is an unknown option: --sataportcount mentioned, although the command given should be --portcount 3.
Where is sata added from?

I tried a new ATLAS-task without doing something myself and that's running now without a problem.

Easter pentecost case?
ID: 47142 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2411
Credit: 226,399,553
RAC: 131,759
Message 47144 - Posted: 11 Aug 2022, 8:53:37 UTC - in response to Message 47142.  

Very strange, see:
https://github.com/BOINC/boinc/blob/76dffd0dda2ee7a7f881e3dd08c35edec497a504/samples/vboxwrapper/vbox_vboxmanage.cpp#L449-L455

For some unknown reason the function "is_virtualbox_version_newer(4, 3, 0)" returned your vbox version is not newer than 4.3.0 although stderr.txt shows it is 6.1.36.

Hence the wrong option "sataportcount" instead of "portcount".


BTW:
VirtualBox sometimes allows to use command option names from older versions even in newer versions, sometimes (like here) they don't allow it.
ID: 47144 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : ATLAS application : ATLAS vbox v2.02


©2024 CERN