1) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45968)
Posted 31 Dec 2021 by tgm
Post:
I've moved to the next step... This system (PC6) has been rebuilt from scratch with only the minimal applications installed (on a Dell). Windows 10 Pro 21H2 with all updates.
The only things added:
BOINC (7.16.20)
VirtualBox (6.1.30) NO extensions installed
Microsoft Visual C++ redistributables
Synology backup agent (2.2.0) (NOT installed when original LHC problems encountered)
TaskInfo (10.0.0.336)
NO Anti-Virus software beyond Microsoft Defender (NO others pre-installed by Dell either) C:\ProgramData\BOINC excluded
NO VPN software installed
NO BOINC projects other than LHC@home are defined at this point

Well, this is about as vanilla as you get. So here we go with some processing results...

At first, a number of ATLAS workunits downloaded and tried to run. All of them crashed after about 6+ minutes. I currently have BIONC throttled to 14 CPU's and 50% CPU load. The workunits downloaded showed to be for 8 CPU's. I noticed that the CPU load never increased. I also saw that the VBox Command Line Tool and Console Window Host processes were continually starting and stopping about every 15 seconds. I did some more research to see if I could determine where these processes were actually running. I couldn't find any locations other than inside the c:\ProgramData\BOINC file structure seems to be involved. As noted above, this directory tree is excluded from Microsoft Defender.

But then the situation got worse... I enabled workunit download in BOINC again and this time the machine received downloads of CMS Simulation and Theory Simulation and NO Atlas. I aborted them and then adjusted my LHC settings to only receive ATLAS workunits. This did NOT work. The machine continued to download CMS and Theory workunits. I performed a number of project updates. I recycled the BOINC service. I even rebooted the machine and waited an hour. It seems clear that there is an issue with the selection of workunit types within LHC.

So, I let the CMS and Theory workunits then run and they too failed with Computation errors. Similar to ATLAS, the machine CPU didn't load up and Virtualbox related processes were starting and dying off every 15 seconds (about). I watched each CPU core and thread in TaskInfo and saw no load at all.

I really don't have any more ideas. Is it Virtaulbox, Windows 21H2, BOINC, LHC@home, or some combination? I need to get this machine back to a production state and reinstall lots of stuff again including a different AV package (Trend Micro was previously installed prior to the rebuild).
2) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45967)
Posted 31 Dec 2021 by tgm
Post:
...
As written above the CERN specific VMs do not even start.


I'm not sure what they are doing for 6+ minutes after being invoked. The BOINC task counters show that something is going on. (I've watched them)
3) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45956)
Posted 28 Dec 2021 by tgm
Post:
Let me assure you that the Windows Sandbox feature is NOT ENABLED and has never been so on this machine (PC1). It was once allowed on PC6 but has not been so since Virtualbox was installed. NONE of the Microsoft Hyper-V features have ever been enabled either. All of the required BIOS settings (VT) are enabled and have been all along. NONE of the potential interfering technologies listed in the referenced Virtualbox post are in this environment either and I've invoked, " bcdedit /set hypervisorlaunchtype off " as suggested in that post. I've even gone further and made sure Credential Guard and Device Guard are fully disabled with, " DG_Readiness_Tool_v3.6.ps1 -Disable " (never enabled either).

Yes, I see the same error messages that you do, but it appears that these are not accurate. Performing a Google search on, "Detected: Sandbox Configuration Enabled", brings up some interesting results. First, this is not the first time this error has come up in LHC processing (see: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjJ0LWu0oX1AhWvl4kEHZlmDpwQFnoECAQQAQ&url=https%3A%2F%2Flhcathomedev.cern.ch%2Flhcathome-dev%2Fforum_thread.php%3Fid%3D95%26postid%3D1254&usg=AOvVaw1QKAMAnBeZF5Aw4ZB5jZS1 ). This involved a Mac box though.

Even more curious is a post in the QuChemPedIA@home number crunching boards with very similar error output on a Windows 10 box (see: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjJ0LWu0oX1AhWvl4kEHZlmDpwQFnoECAYQAQ&url=https%3A%2F%2Fquchempedia.univ-angers.fr%2Fathome%2Fforum_thread.php%3Fid%3D20%26sort_style%3D%26start%3D20&usg=AOvVaw0LSI5wNq8lwh0JxZP3UgiG ). Take note of PHILIPPE's message and the output he received.

But here is some more info that may be related... Both PC1 and PC6 have Windows Pro installed; PC1 is W11 and PC6 is W10. Both of these machines are recent builds and have version 21H2 installed. From what I can tell W11 is more of a look and feel update than anything else. I would bet that most of the code base is the same. Both PC1 and PC6 have Virtualbox 6.1.30 installed with one having VBox extensions installed and the other not. But I also experienced the same error output with VBox 5.2.44 installed (without extensions).

So are the ATLAS work units using similar code bases as QuChemPedIA@home. Is this a BOINC issue? Both of them showing sandbox errors when not installed and also showing, "Error in guest additions for VM: -182 " and " Error in host info for VM: -182 " is a bit suspect. My guess is that the specific conditions that throw these errors is the culprit. We can be pretty sure that it's not the VBox Extension pack or Windows Sandbox functionality though. Hyper-V may be related but syteminfo shows that:
VM Monitor Mode Extensions: Yes
Virtualization Enabled In Firmware: Yes
Second Level Address Translation: Yes
Data Execution Prevention Available: Yes
If Hyper-V was in use, one or more of these would be a "no"

The only other specific similarities that I can think of is that both machines use NordVPN which twiddles with the routing table and both have TrendMicro's anti-virus/anti-malware product installed.

Where to go from here... My guess is that the group handling the ATLAS codebase needs to look at this.
4) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45951)
Posted 27 Dec 2021 by tgm
Post:
On my other PC I did a complete removal of Virtualbox including every remnant I could find in the file system and registry. Reboot. Reinstall 6.1.30 without extensions. It didn't ask to upgrade extensions either. Reboot. Opened up LHC to processing. Same errors occurring at the same times. Does not appear to be associated with extensions. The long and the short of it is that ATLAS support will need to do some effort to get it working properly with Virtualbox 6.x. Seeing that 5.x and 6.0 has been out of support for more than a year; waiting on this effort is probably not a good idea where it's only a matter of time before the BOINC project delivers 6.x as part of upgrades. One thing I also notice is that the size of the Virtualbox install package has fluctuated a lot, both up and down. I wonder if some included/excluded pieces may be impacting things.
5) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45907)
Posted 21 Dec 2021 by tgm
Post:
Well, I downgraded to Virtualbox 5.2.44 and the results changed but they still have errors. Ran for hours instead of 6 minutes each. It looked like it was working but failed in the end. The install was with the defaults so whatever Virtualbox throws in there for sandbox is what's there. Virtualbox will not even initialize if Hyper-V is installed. I've spent too much time on this, so I'm just disabling ATLAS units for now. I probably will need to run newer Virtualbox anyway where 5.2.x and 6.0 versions have been out of support for more than a year. I have other environments to be concerned with that are the priority.

Thanks, Tim
6) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45900)
Posted 20 Dec 2021 by tgm
Post:
Well, I opened up things so you can view my computers... The issues are with Atlas on PC6. Guess what, I've provided FAR, FAR, FAR more diagnostic info than any of those records show. Another tidbit... I see that PC1 is having the same problem. This too has the current Virtualbox, but it's running under Windows 11 (with far fewer resources).

In an attempt to repair the Virtualbox, I ended up trashing the mouse drivers and had to go a bit crazy going through a recovery without a mouse. Don't buy into some of the fixes for the -182 error that are out there! I just love Windows!!!!! On Linux, Unix, etc you can do just about anything from the command line. In any case, my next move it to go back to an earlier version of Virtualbox.

Tim

BTW... is there a way to shut off only Atlas work units from LHC?

IDENTICAL is only a concept...
7) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45897)
Posted 19 Dec 2021 by tgm
Post:
Your Computer is not visable for us.

What is it that you are trying to see that I have not provided? I'm certainly don't want my machine to be visible outside of my network. Or is it just a data payload sent back to to the server? I sure would like to know a lot more regarding what specific info is in that payload if this is the case.
8) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45893)
Posted 18 Dec 2021 by tgm
Post:
I had to wait for some more Atlas work units, so now I have some data to look at...

First, the work units seem to start and process but all of them crash after about 6 minutes. This is consistent and they all show computation error. Some of the data I was looking at from the slots suggest that Virtualbox might not be installed properly, so I stopped BOINC, uninstalled Virtualbox, rebooted, downloaded Vbox and the extension pack, and reinstalled Vbox and the extensions again. The results were the same.

Here is a copy of the output in the Atlas slot (before it was replaced by another project).:

stderr.txt

2021-12-18 16:38:53 (24508): Detected: vboxwrapper 26197
2021-12-18 16:38:53 (24508): Detected: BOINC client v7.7
2021-12-18 16:38:53 (24508): Status Report: Launching vboxsvc.exe. (PID = '8')
2021-12-18 16:40:34 (24508): Error in guest additions for VM: -182
Command:
VBoxManage -q list systemproperties
Output:

2021-12-18 16:40:34 (24508): Detected: VirtualBox VboxManage Interface (Version: 6.1.30)
2021-12-18 16:40:34 (24508): Detected: Sandbox Configuration Enabled

vboxreplay.txt

"VBoxSVC.exe" --logrotate 1
VBoxManage -q --version
VBoxManage -q list systemproperties
VBoxManage -q list systemproperties
VBoxManage -q list systemproperties
VBoxManage -q list systemproperties
VBoxManage -q list systemproperties
VBoxManage -q list systemproperties
VBoxManage -q list hostinfo
VBoxManage -q list hostinfo
VBoxManage -q list hostinfo
VBoxManage -q list hostinfo

init_data.xml

<app_init_data>
<major_version>7</major_version>
<minor_version>16</minor_version>
<release>20</release>
<app_version>200</app_version>
<userid>163767</userid>
<teamid>0</teamid>
<hostid>10699976</hostid>
<app_name>ATLAS</app_name>
<project_preferences>

<apps_selected>
<app_id>1</app_id>
<app_id>11</app_id>
<app_id>13</app_id>
<app_id>14</app_id>
</apps_selected>
<allow_non_preferred_apps>1</allow_non_preferred_apps>
<max_jobs>0</max_jobs>
<max_cpus>0</max_cpus>
</project_preferences>
<user_name>tgm</user_name>
<project_dir>C:\ProgramData\BOINC/projects/lhcathome.cern.ch_lhcathome</project_dir>
<boinc_dir>C:\ProgramData\BOINC</boinc_dir>
<authenticator>6cdf82f197597539c1f4d644cbc8e49c</authenticator>
<wu_name>13lMDmn74C0n9Rq4apoT9bVoABFKDmABFKDmkUTQDmABFKDmB3IJin</wu_name>
<result_name>13lMDmn74C0n9Rq4apoT9bVoABFKDmABFKDmkUTQDmABFKDmB3IJin_3</result_name>
<comm_obj_name>boinc_4</comm_obj_name>
<slot>4</slot>
<client_pid>7464</client_pid>
<wu_cpu_time>0.000000</wu_cpu_time>
<starting_elapsed_time>0.000000</starting_elapsed_time>
<using_sandbox>1</using_sandbox>
<vm_extensions_disabled>0</vm_extensions_disabled>
<user_total_credit>1242355.377128</user_total_credit>
<user_expavg_credit>1.707136</user_expavg_credit>
<host_total_credit>46.036917</host_total_credit>
<host_expavg_credit>1.177390</host_expavg_credit>
<resource_share_fraction>0.526316</resource_share_fraction>
<checkpoint_period>360.000000</checkpoint_period>
<fraction_done_start>0.000000</fraction_done_start>
<fraction_done_end>1.000000</fraction_done_end>
<gpu_type></gpu_type>
<gpu_device_num>-1</gpu_device_num>
<gpu_opencl_dev_index>-1</gpu_opencl_dev_index>
<gpu_usage>0.000000</gpu_usage>
<ncpus>8.000000</ncpus>
<rsc_fpops_est>43200000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>6000000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>10695475200.000000</rsc_memory_bound>
<rsc_disk_bound>10000000000.000000</rsc_disk_bound>
<computation_deadline>1640467162.000000</computation_deadline>
<vbox_window>0</vbox_window>
<no_priority_change>0</no_priority_change>
<process_priority>-1</process_priority>
<process_priority_special>-1</process_priority_special>
<host_info>
<timezone>-18000</timezone>
<domain_name>PC6</domain_name>
<ip_addr>192.168.56.1</ip_addr>
<host_cpid>2621baeceb17c00bfdc599f9f6163fa8</host_cpid>
<p_ncpus>28</p_ncpus>
<p_vendor>GenuineIntel</p_vendor>
<p_model>Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz [Family 6 Model 85 Stepping 7]</p_model>
<p_features>fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 dca pbe fsgsbase bmi1 smep bmi2</p_features>
<p_fpops>4952372366.009323</p_fpops>
<p_iops>17043145481.827120</p_iops>
<p_membw>71428571.428571</p_membw>
<p_calculated>1638181882.457922</p_calculated>
<p_vm_extensions_disabled>0</p_vm_extensions_disabled>
<m_nbytes>68414869504.000000</m_nbytes>
<m_cache>262144.000000</m_cache>
<m_swap>78615416832.000000</m_swap>
<d_total>1999068200960.000000</d_total>
<d_free>1398867492864.000000</d_free>
<os_name>Microsoft Windows 10</os_name>
<os_version>Professional x64 Edition, (10.00.19044.00)</os_version>
<n_usable_coprocs>0</n_usable_coprocs>
<wsl_available>0</wsl_available>
<virtualbox_version>6.1.30</virtualbox_version>
<coprocs>
</coprocs>
</host_info>
<proxy_info>
<socks_server_name></socks_server_name>
<socks_server_port>80</socks_server_port>
<http_server_name></http_server_name>
<http_server_port>80</http_server_port>
<socks5_user_name></socks5_user_name>
<socks5_user_passwd></socks5_user_passwd>
<socks5_remote_dns>0</socks5_remote_dns>
<http_user_name></http_user_name>
<http_user_passwd></http_user_passwd>
<no_proxy></no_proxy>
<no_autodetect>0</no_autodetect>
</proxy_info>
<global_preferences>
<source_project>http://www.worldcommunitygrid.org/</source_project>
<mod_time>1591503487.000000</mod_time>
<battery_charge_min_pct>90.000000</battery_charge_min_pct>
<battery_max_temperature>40.000000</battery_max_temperature>
<run_on_batteries>0</run_on_batteries>
<run_if_user_active>1</run_if_user_active>
<run_gpu_if_user_active>0</run_gpu_if_user_active>
<suspend_if_no_recent_input>0.000000</suspend_if_no_recent_input>
<suspend_cpu_usage>25.000000</suspend_cpu_usage>
<start_hour>0.000000</start_hour>
<end_hour>0.000000</end_hour>
<net_start_hour>0.000000</net_start_hour>
<net_end_hour>0.000000</net_end_hour>
<leave_apps_in_memory>1</leave_apps_in_memory>
<confirm_before_connecting>0</confirm_before_connecting>
<hangup_if_dialed>0</hangup_if_dialed>
<dont_verify_images>0</dont_verify_images>
<work_buf_min_days>0.010000</work_buf_min_days>
<work_buf_additional_days>0.100000</work_buf_additional_days>
<max_ncpus_pct>50.000000</max_ncpus_pct>
<cpu_scheduling_period_minutes>60.000000</cpu_scheduling_period_minutes>
<disk_interval>360.000000</disk_interval>
<disk_max_used_gb>100.000000</disk_max_used_gb>
<disk_max_used_pct>10.000000</disk_max_used_pct>
<disk_min_free_gb>0.000000</disk_min_free_gb>
<vm_max_used_pct>75.000000</vm_max_used_pct>
<ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>
<idle_time_to_run>3.000000</idle_time_to_run>
<max_bytes_sec_up>0.000000</max_bytes_sec_up>
<max_bytes_sec_down>0.000000</max_bytes_sec_down>
<cpu_usage_limit>5.000000</cpu_usage_limit>
<daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb>
<daily_xfer_period_days>0</daily_xfer_period_days>
<override_file_present>1</override_file_present>
<network_wifi_only>1</network_wifi_only>
</global_preferences>
<app_file>vboxwrapper_26198ab7_windows_x86_64.exe</app_file>
<app_file>ATLAS_vbox_2.00_job.xml</app_file>
<app_file>ATLAS_vbox_2.00_image.vdi</app_file>
</app_init_data>

vboxtrace.txt

2021-12-18 16:38:53 (24508):
Command: "VBoxSVC.exe" --logrotate 1
Exit Code: 0
Output:

2021-12-18 16:38:54 (24508):
Command: VBoxManage -q --version
Exit Code: 0
Output:
6.1.30r148432

2021-12-18 16:38:54 (24508):
Command: VBoxManage -q list systemproperties
Exit Code: -2147024891
Output:
VBoxManage.exe: error: Failed to create the VirtualBox object!
VBoxManage.exe: error: The object is not ready
VBoxManage.exe: error: Details: code E_ACCESSDENIED (0x80070005), component VirtualBoxClientWrap, interface IVirtualBoxClient

2021-12-18 16:38:55 (24508):
Command: VBoxManage -q list systemproperties
Exit Code: -2147024891
Output:
VBoxManage.exe: error: Failed to create the VirtualBox object!
VBoxManage.exe: error: The object is not ready
VBoxManage.exe: error: Details: code E_ACCESSDENIED (0x80070005), component VirtualBoxClientWrap, interface IVirtualBoxClient

2021-12-18 16:38:56 (24508):
Command: VBoxManage -q list systemproperties
Exit Code: -2147024891
Output:
VBoxManage.exe: error: Failed to create the VirtualBox object!
VBoxManage.exe: error: The object is not ready
VBoxManage.exe: error: Details: code E_ACCESSDENIED (0x80070005), component VirtualBoxClientWrap, interface IVirtualBoxClient

2021-12-18 16:38:58 (24508):
Command: VBoxManage -q list systemproperties
Exit Code: -2147024891
Output:
VBoxManage.exe: error: Failed to create the VirtualBox object!
VBoxManage.exe: error: The object is not ready
VBoxManage.exe: error: Details: code E_ACCESSDENIED (0x80070005), component VirtualBoxClientWrap, interface IVirtualBoxClient

2021-12-18 16:39:46 (24508):
Command: VBoxManage -q list systemproperties
Exit Code: -182
Output:

2021-12-18 16:40:34 (24508):
Command: VBoxManage -q list systemproperties
Exit Code: -182
Output:

2021-12-18 16:41:21 (24508):
Command: VBoxManage -q list hostinfo
Exit Code: -182
Output:

2021-12-18 16:42:09 (24508):
Command: VBoxManage -q list hostinfo
Exit Code: -182
Output:

2021-12-18 16:42:57 (24508):
Command: VBoxManage -q list hostinfo
Exit Code: -182
Output:

2021-12-18 16:43:45 (24508):
Command: VBoxManage -q list hostinfo
Exit Code: -182
Output:

Hopefully this makes sense to somebody... From the info I saw in BOINCmgr, all work units were trying to process with 8 CPU's. I have 50% allocated in BOINC preferences (of 14 CPU's w/ 28 threads), so there are plenty of resources available to BOINC. I have some additional logfiles from some other crashes but they seem to be essentially the same.

Thanks, Tim

IDENTICAL is only a concept...
9) Message boards : ATLAS application : Repeated computation errors - Missing Files (Message 45881)
Posted 17 Dec 2021 by tgm
Post:
Well, my issue is a bit different... The event log shows that a result file is missing and throws a Computation Error. For example:

12/16/2021 8:40:41 PM | LHC@home | Computation for task EThLDmiK0E0np2BDcpmwOghnABFKDmABFKDmt5BNDmABFKDmSg9dqm_2 finished
12/16/2021 8:40:41 PM | LHC@home | Output file EThLDmiK0E0np2BDcpmwOghnABFKDmABFKDmt5BNDmABFKDmSg9dqm_2_r23588790_ATLAS_result for task EThLDmiK0E0np2BDcpmwOghnABFKDmABFKDmt5BNDmABFKDmSg9dqm_2 absent
12/16/2021 8:40:41 PM | LHC@home | Starting task fUVKDm1i1E0nfZGDcpSWOuwoABFKDmABFKDmTTVVDmABFKDmVgv8mm_2

This is occurring on ALL Atlas work units on this machine. The file missing error is similar in each work unit error log entry.

This machine is relatively new with BOINC and Virtualbox software downloaded recently too (latest). The Virtualbox matching extension pack is also installed.

I have seen some BOINC <--> Virtualbox issues with other unrelated apps that seem to be timing issues. It seems that BOINC may be having problems sequencing priorities when an app is downloaded to use ALL available CPU cores that infrequently become all available (Cosmology@Home). Other LHC@home apps and Rosetta@Home apps are downloading work units and using Virtaulbox just fine. I have not been using Virtualbox for other stuff yet (non-BOINC).

Any thoughts?

Thanks, Tim

IDENTICAL is only a concept...
10) Message boards : LHCb Application : VM Hypervisor failed to enter an online state in a timely fashion. (Message 32956)
Posted 1 Nov 2017 by tgm
Post:
It seems that this issue has come back again with the latest version of VirtualBox (5.1.30). It also seems to cause machine crashes on one of my machines. Seeing it on both Windows 7 and 10 (no 8 here). I'll be downgrading VirtualBox back to 5.1.26.



©2024 CERN