Message boards : ATLAS application : Repeated computation errors - Missing Files
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
tgm

Send message
Joined: 5 Sep 09
Posts: 10
Credit: 1,247,559
RAC: 0
Message 45951 - Posted: 27 Dec 2021, 7:12:18 UTC

On my other PC I did a complete removal of Virtualbox including every remnant I could find in the file system and registry. Reboot. Reinstall 6.1.30 without extensions. It didn't ask to upgrade extensions either. Reboot. Opened up LHC to processing. Same errors occurring at the same times. Does not appear to be associated with extensions. The long and the short of it is that ATLAS support will need to do some effort to get it working properly with Virtualbox 6.x. Seeing that 5.x and 6.0 has been out of support for more than a year; waiting on this effort is probably not a good idea where it's only a matter of time before the BOINC project delivers 6.x as part of upgrades. One thing I also notice is that the size of the Virtualbox install package has fluctuated a lot, both up and down. I wonder if some included/excluded pieces may be impacting things.
IDENTICAL is only a concept...
ID: 45951 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,850,892
RAC: 37,952
Message 45952 - Posted: 27 Dec 2021, 8:00:24 UTC - in response to Message 45951.  

VMs from LHC@home require VirtualBox 64-bit support.
It's not possible to run them as long as certain Windows components are enabled.

Your logfiles show that those components are enabled although you already have been asked to disable them.
2021-12-26 23:02:39 (23948): Detected: Sandbox Configuration Enabled


Further information:
https://forums.virtualbox.org/viewtopic.php?f=1&t=62339
I suggest to carefully read the complete thread.
ID: 45952 · Report as offensive     Reply Quote
tgm

Send message
Joined: 5 Sep 09
Posts: 10
Credit: 1,247,559
RAC: 0
Message 45956 - Posted: 28 Dec 2021, 5:19:55 UTC - in response to Message 45952.  

Let me assure you that the Windows Sandbox feature is NOT ENABLED and has never been so on this machine (PC1). It was once allowed on PC6 but has not been so since Virtualbox was installed. NONE of the Microsoft Hyper-V features have ever been enabled either. All of the required BIOS settings (VT) are enabled and have been all along. NONE of the potential interfering technologies listed in the referenced Virtualbox post are in this environment either and I've invoked, " bcdedit /set hypervisorlaunchtype off " as suggested in that post. I've even gone further and made sure Credential Guard and Device Guard are fully disabled with, " DG_Readiness_Tool_v3.6.ps1 -Disable " (never enabled either).

Yes, I see the same error messages that you do, but it appears that these are not accurate. Performing a Google search on, "Detected: Sandbox Configuration Enabled", brings up some interesting results. First, this is not the first time this error has come up in LHC processing (see: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjJ0LWu0oX1AhWvl4kEHZlmDpwQFnoECAQQAQ&url=https%3A%2F%2Flhcathomedev.cern.ch%2Flhcathome-dev%2Fforum_thread.php%3Fid%3D95%26postid%3D1254&usg=AOvVaw1QKAMAnBeZF5Aw4ZB5jZS1 ). This involved a Mac box though.

Even more curious is a post in the QuChemPedIA@home number crunching boards with very similar error output on a Windows 10 box (see: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjJ0LWu0oX1AhWvl4kEHZlmDpwQFnoECAYQAQ&url=https%3A%2F%2Fquchempedia.univ-angers.fr%2Fathome%2Fforum_thread.php%3Fid%3D20%26sort_style%3D%26start%3D20&usg=AOvVaw0LSI5wNq8lwh0JxZP3UgiG ). Take note of PHILIPPE's message and the output he received.

But here is some more info that may be related... Both PC1 and PC6 have Windows Pro installed; PC1 is W11 and PC6 is W10. Both of these machines are recent builds and have version 21H2 installed. From what I can tell W11 is more of a look and feel update than anything else. I would bet that most of the code base is the same. Both PC1 and PC6 have Virtualbox 6.1.30 installed with one having VBox extensions installed and the other not. But I also experienced the same error output with VBox 5.2.44 installed (without extensions).

So are the ATLAS work units using similar code bases as QuChemPedIA@home. Is this a BOINC issue? Both of them showing sandbox errors when not installed and also showing, "Error in guest additions for VM: -182 " and " Error in host info for VM: -182 " is a bit suspect. My guess is that the specific conditions that throw these errors is the culprit. We can be pretty sure that it's not the VBox Extension pack or Windows Sandbox functionality though. Hyper-V may be related but syteminfo shows that:
VM Monitor Mode Extensions: Yes
Virtualization Enabled In Firmware: Yes
Second Level Address Translation: Yes
Data Execution Prevention Available: Yes
If Hyper-V was in use, one or more of these would be a "no"

The only other specific similarities that I can think of is that both machines use NordVPN which twiddles with the routing table and both have TrendMicro's anti-virus/anti-malware product installed.

Where to go from here... My guess is that the group handling the ATLAS codebase needs to look at this.
IDENTICAL is only a concept...
ID: 45956 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45957 - Posted: 28 Dec 2021, 8:37:56 UTC - in response to Message 45956.  

The only other specific similarities that I can think of is that both machines use NordVPN which twiddles with the routing table and both have TrendMicro's anti-virus/anti-malware product installed.

The AV probably has "real-time protection" enabled. That usually operates at a network level to inspect the packets even if the project is excluded. I would disable the AV entirely.
ID: 45957 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 2,013
Message 45960 - Posted: 28 Dec 2021, 9:20:38 UTC

When you stop this hundreds of Atlas Error-Tasks and make a Test with a FEW Theory-Tasks or CMS-Tasks.
Do they running well, or are there other messages?
ID: 45960 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,850,892
RAC: 37,952
Message 45964 - Posted: 28 Dec 2021, 10:20:11 UTC - in response to Message 45956.  

... not the first time this error has come up in LHC processing ...

The link points to an old thread (Oct 2015) in the LHC development forum.
The affected system was a Mac running vbox 4.3.30.


QuChemPedIA@home number crunching boards ...
Take note of PHILIPPE's message and the output he received.

He got an out of memory message since his computer had only 4 GB RAM.
Beside that the posts are nearly 2 years old but the suggestions are nearly the same, e.g. disable Hyper-V ...


My guess is that the specific conditions that throw these errors

Right.
It's caused by the global setup on the affected computers.


Error in host info for VM

This is reported by "VBoxManage", which is part of the VirtualBox suite.
The command is called from within vboxwrapper and does nothing but collecting some information about the local hardware.
This points out that there are major problems running simple commands.


The only other specific similarities that I can think of is that both machines use NordVPN which twiddles with the routing table and both have TrendMicro's anti-virus/anti-malware product installed.

Regarding NordVPN
The VMs do not even start.
Hence, the have not yet a network connection.
Once the run they make tons of internet requests all going via the NordVPN servers.
This makes the VM's network less efficient and the NordVPN admins might not be happy about that.
Did you ask them for permission?

Regarding AV software
You may follow Jim's advice to exclude BOINC folders and it's network activity from your AV checks.
This was already suggested in Yeti's checklist and the threads you cited.


... the group handling the ATLAS codebase needs to look at this ...

They prepare an update that will distribute a more recent vboxwrapper but I doubt this would solve the issues here.
As written above the CERN specific VMs do not even start.
ID: 45964 · Report as offensive     Reply Quote
tgm

Send message
Joined: 5 Sep 09
Posts: 10
Credit: 1,247,559
RAC: 0
Message 45967 - Posted: 31 Dec 2021, 14:12:40 UTC - in response to Message 45964.  

...
As written above the CERN specific VMs do not even start.


I'm not sure what they are doing for 6+ minutes after being invoked. The BOINC task counters show that something is going on. (I've watched them)
IDENTICAL is only a concept...
ID: 45967 · Report as offensive     Reply Quote
tgm

Send message
Joined: 5 Sep 09
Posts: 10
Credit: 1,247,559
RAC: 0
Message 45968 - Posted: 31 Dec 2021, 17:47:00 UTC

I've moved to the next step... This system (PC6) has been rebuilt from scratch with only the minimal applications installed (on a Dell). Windows 10 Pro 21H2 with all updates.
The only things added:
BOINC (7.16.20)
VirtualBox (6.1.30) NO extensions installed
Microsoft Visual C++ redistributables
Synology backup agent (2.2.0) (NOT installed when original LHC problems encountered)
TaskInfo (10.0.0.336)
NO Anti-Virus software beyond Microsoft Defender (NO others pre-installed by Dell either) C:\ProgramData\BOINC excluded
NO VPN software installed
NO BOINC projects other than LHC@home are defined at this point

Well, this is about as vanilla as you get. So here we go with some processing results...

At first, a number of ATLAS workunits downloaded and tried to run. All of them crashed after about 6+ minutes. I currently have BIONC throttled to 14 CPU's and 50% CPU load. The workunits downloaded showed to be for 8 CPU's. I noticed that the CPU load never increased. I also saw that the VBox Command Line Tool and Console Window Host processes were continually starting and stopping about every 15 seconds. I did some more research to see if I could determine where these processes were actually running. I couldn't find any locations other than inside the c:\ProgramData\BOINC file structure seems to be involved. As noted above, this directory tree is excluded from Microsoft Defender.

But then the situation got worse... I enabled workunit download in BOINC again and this time the machine received downloads of CMS Simulation and Theory Simulation and NO Atlas. I aborted them and then adjusted my LHC settings to only receive ATLAS workunits. This did NOT work. The machine continued to download CMS and Theory workunits. I performed a number of project updates. I recycled the BOINC service. I even rebooted the machine and waited an hour. It seems clear that there is an issue with the selection of workunit types within LHC.

So, I let the CMS and Theory workunits then run and they too failed with Computation errors. Similar to ATLAS, the machine CPU didn't load up and Virtualbox related processes were starting and dying off every 15 seconds (about). I watched each CPU core and thread in TaskInfo and saw no load at all.

I really don't have any more ideas. Is it Virtaulbox, Windows 21H2, BOINC, LHC@home, or some combination? I need to get this machine back to a production state and reinstall lots of stuff again including a different AV package (Trend Micro was previously installed prior to the rebuild).
IDENTICAL is only a concept...
ID: 45968 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 45969 - Posted: 31 Dec 2021, 18:59:11 UTC - in response to Message 45968.  

I currently have BIONC throttled to 14 CPU's and 50% CPU load. The workunits downloaded showed to be for 8 CPU's. I noticed that the CPU load never increased. I also saw that the VBox Command Line Tool and Console Window Host processes were continually starting and stopping about every 15 seconds.

There is nothing wrong with limiting the number of CPU's, but have you tried 100% load?

Also, in Computing Preferences, uncheck "Suspend when non-BOINC usage is above ...%.
And allow 100% of the memory to be used by BOINC.
ID: 45969 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 2,013
Message 45970 - Posted: 1 Jan 2022, 6:46:54 UTC - in response to Message 45969.  

Do you have the sandbox on?
If yes, how do you use it for LHC@Home?
ID: 45970 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,850,892
RAC: 37,952
Message 45971 - Posted: 1 Jan 2022, 11:51:05 UTC - in response to Message 45968.  

Happy new near to everybody.



This system (PC6)...

Just to mention it: nobody but the owner can see the computer names.
Others see ranking numbers (depending on which computer contacted the project last).
Hence, if a volunteer runs a couple of computers others have to make a guess.
Would be better to post a link (including the DB ID) and make it an URL.
Example:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10699976



... have BIONC throttled to 14 CPU's and 50% CPU load.

Not relevant here since none of the VMs ever get launched.
(may have posted that last year)


The workunits downloaded showed to be for 8 CPU's.

The default (ATLAS vbox) configuration sent by the project for computers having at least 8 cores.
Shown in BOINC but not relevant here since none of the VMs ever ...


All of them crashed after about 6+ minutes.
.
.
.
... continually starting and stopping about every 15 seconds.

BOINC really tries everything to launch the task but it always gets killed.
After 5-6 min vboxwrapper/BOINC reach a watchdog limit that ends non responding tasks.


I noticed that the CPU load never increased

Sure, since none of ...



... inside the c:\ProgramData\BOINC file structure seems to be involved. ... this directory tree is excluded from Microsoft Defender.

And the user "boinc" (or whatever you use) has all necessary access rights to that folder?


...the situation got worse... I enabled workunit download in BOINC again and this time the machine received downloads of CMS Simulation and Theory Simulation and NO Atlas.

This depends on the settings you made here:
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project
Guess "If no work for selected applications is available, accept work from other applications?" is selected.

The "good" thing:
Theory and CMS fail with the very same error pattern, although they have different VMs and use different vboxwrapper versions.



It seems clear that there is an issue with the selection of workunit types within LHC

Sure, it's never the fault of the guy sitting in front of the computer!



What to do?


1. Reboot
2. Don't start BOINC
3. Login using the account BOINC would use
4. Change to an empty "slots\n\" directory BOINC would normally use
5. Open the VirtualBox GUI and manually create a VM in that directory
Define a RedHat 64-bit Linux guest, 2 CPUs, 2048 MB RAM, 40 GB vdi (dynamically growing)
6. Launch that VM

Check for error messages during the creation process and the VM start.

There's no need to install any OS on that VM.
Just let it start until it prints the message that there's no OS installed on it's "disk".
Shut down the VM, remove it from the VirtualBox GUI and remove all files below "slots\n\".
ID: 45971 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 2,013
Message 45972 - Posted: 1 Jan 2022, 12:37:09 UTC - in response to Message 45971.  

Happy new near to everybody.

Same procedure as every year.
ID: 45972 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : ATLAS application : Repeated computation errors - Missing Files


©2024 CERN