Message boards :
ATLAS application :
Repeated computation errors - Missing Files
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 5 Sep 09 Posts: 10 Credit: 1,247,559 RAC: 0 |
On my other PC I did a complete removal of Virtualbox including every remnant I could find in the file system and registry. Reboot. Reinstall 6.1.30 without extensions. It didn't ask to upgrade extensions either. Reboot. Opened up LHC to processing. Same errors occurring at the same times. Does not appear to be associated with extensions. The long and the short of it is that ATLAS support will need to do some effort to get it working properly with Virtualbox 6.x. Seeing that 5.x and 6.0 has been out of support for more than a year; waiting on this effort is probably not a good idea where it's only a matter of time before the BOINC project delivers 6.x as part of upgrades. One thing I also notice is that the size of the Virtualbox install package has fluctuated a lot, both up and down. I wonder if some included/excluded pieces may be impacting things. IDENTICAL is only a concept... |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,226,270 RAC: 121,147 |
VMs from LHC@home require VirtualBox 64-bit support. It's not possible to run them as long as certain Windows components are enabled. Your logfiles show that those components are enabled although you already have been asked to disable them. 2021-12-26 23:02:39 (23948): Detected: Sandbox Configuration Enabled Further information: https://forums.virtualbox.org/viewtopic.php?f=1&t=62339 I suggest to carefully read the complete thread. |
Send message Joined: 5 Sep 09 Posts: 10 Credit: 1,247,559 RAC: 0 |
Let me assure you that the Windows Sandbox feature is NOT ENABLED and has never been so on this machine (PC1). It was once allowed on PC6 but has not been so since Virtualbox was installed. NONE of the Microsoft Hyper-V features have ever been enabled either. All of the required BIOS settings (VT) are enabled and have been all along. NONE of the potential interfering technologies listed in the referenced Virtualbox post are in this environment either and I've invoked, " bcdedit /set hypervisorlaunchtype off " as suggested in that post. I've even gone further and made sure Credential Guard and Device Guard are fully disabled with, " DG_Readiness_Tool_v3.6.ps1 -Disable " (never enabled either). Yes, I see the same error messages that you do, but it appears that these are not accurate. Performing a Google search on, "Detected: Sandbox Configuration Enabled", brings up some interesting results. First, this is not the first time this error has come up in LHC processing (see: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjJ0LWu0oX1AhWvl4kEHZlmDpwQFnoECAQQAQ&url=https%3A%2F%2Flhcathomedev.cern.ch%2Flhcathome-dev%2Fforum_thread.php%3Fid%3D95%26postid%3D1254&usg=AOvVaw1QKAMAnBeZF5Aw4ZB5jZS1 ). This involved a Mac box though. Even more curious is a post in the QuChemPedIA@home number crunching boards with very similar error output on a Windows 10 box (see: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjJ0LWu0oX1AhWvl4kEHZlmDpwQFnoECAYQAQ&url=https%3A%2F%2Fquchempedia.univ-angers.fr%2Fathome%2Fforum_thread.php%3Fid%3D20%26sort_style%3D%26start%3D20&usg=AOvVaw0LSI5wNq8lwh0JxZP3UgiG ). Take note of PHILIPPE's message and the output he received. But here is some more info that may be related... Both PC1 and PC6 have Windows Pro installed; PC1 is W11 and PC6 is W10. Both of these machines are recent builds and have version 21H2 installed. From what I can tell W11 is more of a look and feel update than anything else. I would bet that most of the code base is the same. Both PC1 and PC6 have Virtualbox 6.1.30 installed with one having VBox extensions installed and the other not. But I also experienced the same error output with VBox 5.2.44 installed (without extensions). So are the ATLAS work units using similar code bases as QuChemPedIA@home. Is this a BOINC issue? Both of them showing sandbox errors when not installed and also showing, "Error in guest additions for VM: -182 " and " Error in host info for VM: -182 " is a bit suspect. My guess is that the specific conditions that throw these errors is the culprit. We can be pretty sure that it's not the VBox Extension pack or Windows Sandbox functionality though. Hyper-V may be related but syteminfo shows that: VM Monitor Mode Extensions: Yes Virtualization Enabled In Firmware: Yes Second Level Address Translation: Yes Data Execution Prevention Available: Yes If Hyper-V was in use, one or more of these would be a "no" The only other specific similarities that I can think of is that both machines use NordVPN which twiddles with the routing table and both have TrendMicro's anti-virus/anti-malware product installed. Where to go from here... My guess is that the group handling the ATLAS codebase needs to look at this. IDENTICAL is only a concept... |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
The only other specific similarities that I can think of is that both machines use NordVPN which twiddles with the routing table and both have TrendMicro's anti-virus/anti-malware product installed. The AV probably has "real-time protection" enabled. That usually operates at a network level to inspect the packets even if the project is excluded. I would disable the AV entirely. |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,096,611 RAC: 53,412 |
When you stop this hundreds of Atlas Error-Tasks and make a Test with a FEW Theory-Tasks or CMS-Tasks. Do they running well, or are there other messages? |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,226,270 RAC: 121,147 |
... not the first time this error has come up in LHC processing ... The link points to an old thread (Oct 2015) in the LHC development forum. The affected system was a Mac running vbox 4.3.30. QuChemPedIA@home number crunching boards ... He got an out of memory message since his computer had only 4 GB RAM. Beside that the posts are nearly 2 years old but the suggestions are nearly the same, e.g. disable Hyper-V ... My guess is that the specific conditions that throw these errors Right. It's caused by the global setup on the affected computers. Error in host info for VM This is reported by "VBoxManage", which is part of the VirtualBox suite. The command is called from within vboxwrapper and does nothing but collecting some information about the local hardware. This points out that there are major problems running simple commands. The only other specific similarities that I can think of is that both machines use NordVPN which twiddles with the routing table and both have TrendMicro's anti-virus/anti-malware product installed. Regarding NordVPN The VMs do not even start. Hence, the have not yet a network connection. Once the run they make tons of internet requests all going via the NordVPN servers. This makes the VM's network less efficient and the NordVPN admins might not be happy about that. Did you ask them for permission? Regarding AV software You may follow Jim's advice to exclude BOINC folders and it's network activity from your AV checks. This was already suggested in Yeti's checklist and the threads you cited. ... the group handling the ATLAS codebase needs to look at this ... They prepare an update that will distribute a more recent vboxwrapper but I doubt this would solve the issues here. As written above the CERN specific VMs do not even start. |
Send message Joined: 5 Sep 09 Posts: 10 Credit: 1,247,559 RAC: 0 |
... I'm not sure what they are doing for 6+ minutes after being invoked. The BOINC task counters show that something is going on. (I've watched them) IDENTICAL is only a concept... |
Send message Joined: 5 Sep 09 Posts: 10 Credit: 1,247,559 RAC: 0 |
I've moved to the next step... This system (PC6) has been rebuilt from scratch with only the minimal applications installed (on a Dell). Windows 10 Pro 21H2 with all updates. The only things added: BOINC (7.16.20) VirtualBox (6.1.30) NO extensions installed Microsoft Visual C++ redistributables Synology backup agent (2.2.0) (NOT installed when original LHC problems encountered) TaskInfo (10.0.0.336) NO Anti-Virus software beyond Microsoft Defender (NO others pre-installed by Dell either) C:\ProgramData\BOINC excluded NO VPN software installed NO BOINC projects other than LHC@home are defined at this point Well, this is about as vanilla as you get. So here we go with some processing results... At first, a number of ATLAS workunits downloaded and tried to run. All of them crashed after about 6+ minutes. I currently have BIONC throttled to 14 CPU's and 50% CPU load. The workunits downloaded showed to be for 8 CPU's. I noticed that the CPU load never increased. I also saw that the VBox Command Line Tool and Console Window Host processes were continually starting and stopping about every 15 seconds. I did some more research to see if I could determine where these processes were actually running. I couldn't find any locations other than inside the c:\ProgramData\BOINC file structure seems to be involved. As noted above, this directory tree is excluded from Microsoft Defender. But then the situation got worse... I enabled workunit download in BOINC again and this time the machine received downloads of CMS Simulation and Theory Simulation and NO Atlas. I aborted them and then adjusted my LHC settings to only receive ATLAS workunits. This did NOT work. The machine continued to download CMS and Theory workunits. I performed a number of project updates. I recycled the BOINC service. I even rebooted the machine and waited an hour. It seems clear that there is an issue with the selection of workunit types within LHC. So, I let the CMS and Theory workunits then run and they too failed with Computation errors. Similar to ATLAS, the machine CPU didn't load up and Virtualbox related processes were starting and dying off every 15 seconds (about). I watched each CPU core and thread in TaskInfo and saw no load at all. I really don't have any more ideas. Is it Virtaulbox, Windows 21H2, BOINC, LHC@home, or some combination? I need to get this machine back to a production state and reinstall lots of stuff again including a different AV package (Trend Micro was previously installed prior to the rebuild). IDENTICAL is only a concept... |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I currently have BIONC throttled to 14 CPU's and 50% CPU load. The workunits downloaded showed to be for 8 CPU's. I noticed that the CPU load never increased. I also saw that the VBox Command Line Tool and Console Window Host processes were continually starting and stopping about every 15 seconds. There is nothing wrong with limiting the number of CPU's, but have you tried 100% load? Also, in Computing Preferences, uncheck "Suspend when non-BOINC usage is above ...%. And allow 100% of the memory to be used by BOINC. |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,096,611 RAC: 53,412 |
Do you have the sandbox on? If yes, how do you use it for LHC@Home? |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,226,270 RAC: 121,147 |
Happy new near to everybody. This system (PC6)... Just to mention it: nobody but the owner can see the computer names. Others see ranking numbers (depending on which computer contacted the project last). Hence, if a volunteer runs a couple of computers others have to make a guess. Would be better to post a link (including the DB ID) and make it an URL. Example: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10699976 ... have BIONC throttled to 14 CPU's and 50% CPU load. Not relevant here since none of the VMs ever get launched. (may have posted that last year) The workunits downloaded showed to be for 8 CPU's. The default (ATLAS vbox) configuration sent by the project for computers having at least 8 cores. Shown in BOINC but not relevant here since none of the VMs ever ... All of them crashed after about 6+ minutes. BOINC really tries everything to launch the task but it always gets killed. After 5-6 min vboxwrapper/BOINC reach a watchdog limit that ends non responding tasks. I noticed that the CPU load never increased Sure, since none of ... ... inside the c:\ProgramData\BOINC file structure seems to be involved. ... this directory tree is excluded from Microsoft Defender. And the user "boinc" (or whatever you use) has all necessary access rights to that folder? ...the situation got worse... I enabled workunit download in BOINC again and this time the machine received downloads of CMS Simulation and Theory Simulation and NO Atlas. This depends on the settings you made here: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project Guess "If no work for selected applications is available, accept work from other applications?" is selected. The "good" thing: Theory and CMS fail with the very same error pattern, although they have different VMs and use different vboxwrapper versions. It seems clear that there is an issue with the selection of workunit types within LHC Sure, it's never the fault of the guy sitting in front of the computer! What to do? 1. Reboot 2. Don't start BOINC 3. Login using the account BOINC would use 4. Change to an empty "slots\n\" directory BOINC would normally use 5. Open the VirtualBox GUI and manually create a VM in that directory Define a RedHat 64-bit Linux guest, 2 CPUs, 2048 MB RAM, 40 GB vdi (dynamically growing) 6. Launch that VM Check for error messages during the creation process and the VM start. There's no need to install any OS on that VM. Just let it start until it prints the message that there's no OS installed on it's "disk". Shut down the VM, remove it from the VirtualBox GUI and remove all files below "slots\n\". |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,096,611 RAC: 53,412 |
Happy new near to everybody. Same procedure as every year. |
©2024 CERN