Thread 'vBox could not find machine - ERROR [COM]: aRC=VBOX_E_OBJECT_NOT

Author	Message
skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45663 - Posted: 11 Nov 2021, 17:17:00 UTC My Win10 host is running BOINC 7.16.20 with vBox 6.1.16 plus the 6.1.16 Extension pack. I've been encountering this issue where my CMS tasks are failing because vBox is unable to find the VM to run the task. I've uninstalled and reinstalled BOINC and vBox several times with the issue persisting. I'm unsure how to continue troubleshooting this. Most recent failed task https://lhcathome.cern.ch/lhcathome/result.php?resultid=332187349 Tasks page for my host https://lhcathome.cern.ch/lhcathome/results.php?hostid=10687301&offset=0&show_names=0&state=6&appid=11 I can see in the BOINC logs that the VM is started in a project slot 2021-11-10 14:25:35 (15216): forwarding host port 52026 to guest port 80 2021-11-10 14:25:35 (15216): Enabling remote desktop for VM. 2021-11-10 14:25:36 (15216): Enabling shared directory for VM. 2021-11-10 14:25:36 (15216): Starting VM using VBoxManage interface. (boinc_62fd0aaede4ddcf6, slot#16) 2021-11-10 14:25:41 (15216): Successfully started VM. (PID = '2380') 2021-11-10 14:25:41 (15216): Reporting VM Process ID to BOINC. 2021-11-10 14:25:41 (15216): Guest Log: BIOS: VirtualBox 6.1.16 2021-11-10 14:25:41 (15216): Guest Log: CPUID EDX: 0x178bfbff 2021-11-10 14:25:41 (15216): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2021-11-10 14:25:41 (15216): VM state change detected. (old = 'poweredoff', new = 'running') But then a short time later, in the vBox hypervisor logs show that the registered VM cannot be found. 17:35:01.292412 ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={d0a0163f-e254-4e5b-a1f2-011cf991c38d} aComponent={VirtualBoxWrap} aText={Could not find a registered machine named 'boinc_62fd0aaede4ddcf6'}, preserve=false aResultDetail=0 The VM trace logs show that it takes about 20 minutes for it to error out and initiate a shutdown of the VM even though it logs it starting. 2021-11-10 14:25:36 (15216): Command: VBoxManage -q sharedfolder add "boinc_62fd0aaede4ddcf6" --name "shared" --hostpath "C:\ProgramData\BOINC\slots\16/shared" Exit Code: 0 Output: 2021-11-10 14:25:40 (15216): Command: VBoxManage -q startvm "boinc_62fd0aaede4ddcf6" --type headless Exit Code: 0 Output: Waiting for VM "boinc_62fd0aaede4ddcf6" to power on... VM "boinc_62fd0aaede4ddcf6" has been successfully started. 2021-11-10 14:25:42 (15216): Command: VBoxManage -q controlvm "boinc_62fd0aaede4ddcf6" cpuexecutioncap 100 Exit Code: 0 Output: 2021-11-10 14:45:43 (15216): Command: VBoxManage -q controlvm "boinc_62fd0aaede4ddcf6" poweroff Exit Code: 0 Output: 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% 2021-11-10 14:45:43 (15216): Command: VBoxManage -q snapshot "boinc_62fd0aaede4ddcf6" list Exit Code: -108 Output: This machine does not have any snapshots 2021-11-10 14:45:44 (15216): Command: VBoxManage -q bandwidthctl "boinc_62fd0aaede4ddcf6" remove "boinc_62fd0aaede4ddcf6_net" Exit Code: 0 Output: What can be looked at next to nail this error down? R/S Scott ID: 45663 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,763,464 RAC: 64,031	Message 45664 - Posted: 11 Nov 2021, 19:53:51 UTC - in response to Message 45663. sks fail very early. This can be seen looking at the CPU time (only 35 seconds). After that the VM remains idle until the heartbeat watchdog shuts down the VM (hence the long total runtime): [pre]2021-11-10 14:45:43 (15216): VM Heartbeat file specified, but missing. 2021-11-10 14:45:43 (15216): VM Heartbeat file specified, but missing file system status. (errno = '2')[/pre] Errors reported after that are caused by some kind of a race condition: There are subroutines that try to check/report the VM's status while others are working on the shutdown. To solve the issue it's necessary to find out why the tasks fail early. Possible reasons: - the vdi got damaged -> reset the project to get a fresh download - wrong/incomplete access rights in \slots\ -> the BOINC user/group requires write access to create/update the vdi and the watchdog file - overloaded disk I/O; How many tasks are writing to the disk concurrently? Is there enough RAM left for the disk cache? ID: 45664 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45665 - Posted: 11 Nov 2021, 21:26:18 UTC - in response to Message 45664. I just reset the project and will know what that does tomorrow afternoon. On the permissions front, my user account runs the BOINC process and and Full Control rights to the C:\ProgramData\BOINC and C:\ProgramData\BOINC\slots directories. I can see my user account owns the processes for BOINC Mgr and the running tasks via Process Explorer. My host has a Samsung 860 EVO 1TB SSD and resource monitor shows disk I/O from 1-4% with BOINC running 16 threads of Rosetta@home tasks. In Use memory is at ~17GB out of 32GB in the system. When I was attempting to run more than two CMS tasks a while back, I would get notices in BOINC that a CMS task was paused, waiting for memory. That said, I've had 14 Rosetta and 2 CMS tasks running concurrently on this host. The past few days LHC@home has only assigned a single CMS task to this host due to the number of failures. R/S Scott ID: 45665 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,763,464 RAC: 64,031	Message 45666 - Posted: 12 Nov 2021, 8:44:08 UTC - in response to Message 45665. This CPU (AMD Ryzen 7 3700X) has 8 floating point units which are under extreme load if you run 14 Rosetta tasks and 2 CMS tasks concurrently. Linux provides a monitoring value called "load average" which gives an impression of how many commands are waiting for execution in the CPU queue. Unfortunately Windows doesn't provide a comparable monitoring value. In reality the correlation is much more complex but it doesn't help to get lost deeper in any details. What you see is the result of running too many heavy processes concurrently. At a certain point they start to switch into a panic mode that makes the situation even worse. To stay below that point you may try to run an project mix that puts less load on the floating point units. ID: 45666 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2291 Credit: 178,881,263 RAC: 2,696	Message 45667 - Posted: 12 Nov 2021, 9:53:52 UTC - in response to Message 45666. Now is 19043 active. Your PC show: Microsoft Windows 10 Professional x64 Edition, (10.00.19042.00) As computezrmle said, reduce your work in Boinc to see when RAM is running without swapping. ID: 45667 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45669 - Posted: 12 Nov 2021, 17:47:47 UTC - in response to Message 45666. Thanks. I've set Rosetta to not fetch new tasks so only LHC CMS tasks will run. I'll see if that does it. If it does, is this an issue of needing more system memory so everything can run in memory without drive swapping? Or is this a limitation on floating point calculations of the processor? R/S Scott ID: 45669 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,763,464 RAC: 64,031	Message 45672 - Posted: 12 Nov 2021, 20:49:07 UTC - in response to Message 45669. A while ago you wrote: ... In Use memory is at ~17GB out of 32GB ... Now you wrote: ... everything can run in memory without drive swapping? Sure you see (heavy) swapping activity? Looks like half of your RAM is unused or used by the disk cache. ID: 45672 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45675 - Posted: 13 Nov 2021, 14:06:02 UTC - in response to Message 45672. That's just it, no I do not see it. I'm going down the path of troubleshooting the vbox.log and vboxhardening.log of the individual VMs when they are running. I've also uninstalled my AntiVirus, rebooted, and ensured its cleared out. The last CMS task my host got yesturday was during a time when Rosetta@home was not sending tasks and my host only had a couple of them running when the CMS task started. https://lhcathome.cern.ch/lhcathome/result.php?resultid=332568799 The CMS task stderr log shows that it there was 26374 MB of memory available when it started. Memory size: 32673 MByte Memory available: 26374 MByte This task still failed in the same way as the past several days despite a much lower CPU and memory load. Note: I have a companion thread to this one on vbox.log and vboxhardening.log capture. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5754 ID: 45675 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2291 Credit: 178,881,263 RAC: 2,696	Message 45676 - Posted: 13 Nov 2021, 14:29:25 UTC - in response to Message 45675. That is from your CMS-task with no success: 2021-11-11 12:57:41 (15744): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds)) That is from a successful CMS-Task: 2021-11-06 07:18:56 (12680): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) ID: 45676 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45679 - Posted: 13 Nov 2021, 15:10:50 UTC - in response to Message 45676. Last modified: 13 Nov 2021, 15:11:33 UTC That is from your CMS-task with no success: 2021-11-11 12:57:41 (15744): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds)) That is from a successful CMS-Task: 2021-11-06 07:18:56 (12680): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) I'm not understanding how that can help. The heartbeat time frame is a variable of the project/task. It's not a user configurable setting that would need to be "corrected" to ensure tasks complete. ID: 45679 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2291 Credit: 178,881,263 RAC: 2,696	Message 45682 - Posted: 13 Nov 2021, 17:13:35 UTC - in response to Message 45679. Last modified: 13 Nov 2021, 17:44:32 UTC (Vbox_job.xml: 600 seconds) Don't find it in my CMS-Task. In a Theory-task is this line shown: 2021-11-12 14:11:49 (5584): Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 1200 seconds) or (Vbox_job.xml: 600 seconds)) ID: 45682 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45688 - Posted: 13 Nov 2021, 19:34:52 UTC So removing Anti-virus did not solve the issue. This task just failed at 1349 EST. https://lhcathome.cern.ch/lhcathome/result.php?resultid=332726303 ID: 45688 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,763,464 RAC: 64,031	Message 45691 - Posted: 13 Nov 2021, 21:35:07 UTC - in response to Message 45688. ll looks like a credential issue. Did you see this? [pre] Hypervisor System Log: 00:00:00.262874 ExtPack: Created cloud provider 'OCI' (hrc=ERROR_SUCCESS)[/pre] Did you check this? https://www.virtualbox.org/manual/ch01.html#cloud-integration https://forums.virtualbox.org/viewtopic.php?f=6&t=93953 ID: 45691 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45705 - Posted: 15 Nov 2021, 23:18:08 UTC - in response to Message 45691. I made a guestimate and was waiting on my host when it received a CMS task. I've captured the VBox, VBoxHardening, VBoxUI, and vbox_trace logs from the ./slots/ folder the task was running in. I'm posting to the VirtualBox forums for help in troubleshooting the logs. https://forums.virtualbox.org/viewtopic.php?f=3&t=104465 If anyone knows how to review and troubleshoot VBox logs, I've uploaded them to a workdrive space https://workdrive.zoho.com/folder/pgoec33ffeb6d9e36461c9a953c076976b93c R/S Scott ID: 45705 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 925 Credit: 780,386,966 RAC: 121,550	Message 45708 - Posted: 16 Nov 2021, 20:05:56 UTC Maybe there is a lot of junk media left over from crashed VM's, since it looks like the hard disk was lost. In regards to the comments on the VirtualBox thread, I think it by design that there is 8MB of VRAM there is no GUI in the VM image so no need for 128MB of graphics memory. its also by design that there is one CPU core and 2 GB of ram to no swamp the host computers with tons of ram usage. Sometimes i get the hung one with exactly the same error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=332767550 but I ignore them since they are only 0.3% of total. Sometime I got stuck in many errors I just abort everything, ensure everything is empty in BOINC and VirtualBox, limit it just to one WU at a time to see if it cleaned out then ramp up from there. ID: 45708 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45727 - Posted: 19 Nov 2021, 23:12:46 UTC - in response to Message 45708. Last modified: 19 Nov 2021, 23:13:31 UTC Maybe there is a lot of junk media left over from crashed VM's, since it looks like the hard disk was lost. I experienced issues that left junk in the .\slots\ folder and it was one of the first troubleshooting steps I did when I started digging into this on my Win10 client. The VM crash persists through multiple VirtualBox versions, 6.1.12, 6.1.16, 6.1.28 and ensuring the BOINC data directory .\slots\ folders and the VBox manager are clean. So far, I've not gotten far on the Virtual Box forums. There are some suggestions on the underlying configuration of the LHC@Home VM, but I'm not in anyway qualified to raise these to the project. As far as I can search on the LHC@Home forums, I can't find anything like what I'm experiencing. If it goes on much longer with no appreciable suggestions on troubleshooting on the contents of the log files, I may have to just remove the project from my Win10 host. I don't want to do that. But I won't have a choice if I can't troubleshoot the issue and just keep sending failing work-units back. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10687301&offset=0&show_names=0&state=6&appid=11 ID: 45727 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2291 Credit: 178,881,263 RAC: 2,696	Message 45728 - Posted: 19 Nov 2021, 23:41:22 UTC - in response to Message 45727. What's about running Atlas or Theory instead of CMS, to see what happens. They need also Virtualbox. ID: 45728 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2726 Credit: 300,763,464 RAC: 64,031	Message 45729 - Posted: 21 Nov 2021, 8:48:45 UTC - in response to Message 45727. s a relevant change in the logfile entries starting with this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=333033783 Before the errors started with a long delay after the VM jumped into the bootstrap script: [pre]2021-11-15 10:14:24 (13320): Guest Log: 00:00:00.001316 main 5.2.6 r120293 started. Verbose level = 0 ### VM hangs here 2021-11-15 10:28:48 (13320): Guest Log: [INFO] Mounting the shared directory 2021-11-15 10:32:54 (13320): VM Heartbeat file specified, but missing. 2021-11-15 10:32:54 (13320): VM Heartbeat file specified, but missing file system status. (errno = '2')[/pre] This finally results in a heartbeat shutdown. The new logfiles show that the VM successfully completes the bootstrap script and starts "glidein" which is the main wrapper of the scientific payload. [pre]2021-11-16 10:12:32 (3888): Guest Log: [INFO] CMS application starting. Check log files. 2021-11-16 10:47:43 (3888): Guest Log: [ERROR] glidein exited with return value 1.[/pre] The major difference is that newer tasks use a local Squid which allows to download additional scripts required within bootstrap. This also affects scripts that are essential to make the logfile more verbose or test important network connections. Older tasks do not use the Squid. Hence, I suspect direct HTTP request from that computer to CERN might be blocked by a local firewall. I also suspect this to be the reason why glidein fails. The VM requires a couple of TCP ports to be open. Beside the ports that are tested within bootstrap CMS requires 1094 and 8000. ID: 45729 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45730 - Posted: 21 Nov 2021, 23:24:46 UTC - in response to Message 45729. I went and looked that over and see what you've found. I believe that is when I updated VBox to 6.1.28. I did see the Glidin error a few days ago and began looking into that, wondering how to troubleshoot port connectivity from the VM perspective. I also suspect this to be the reason why glidein fails. The VM requires a couple of TCP ports to be open. Beside the ports that are tested within bootstrap CMS requires 1094 and 8000. I've had port 8000 open since I started the project, but TCP/1094 was removed as I was told it was no longer needed from the LHC@Home FAQ port list https://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use I've added both TCP/9094 and TCP/1094 back into my BOINC aliases list and now have a running CMS VBox task on my Win10 host. The task has been running for about an hour now and is using ~86% of a CPU. Much different from before where the task would end at <21 minutes and consume no more than 3-4% CPU. Can the LHC@Home port listing be re-validated to ensure all used ports are on the FAQ? ID: 45730 · Reply Quote

skydivingnerd Send message Joined: 8 Apr 21 Posts: 23 Credit: 65,518,838 RAC: 18,091	Message 45733 - Posted: 22 Nov 2021, 12:04:23 UTC - in response to Message 45730. I have a completed work-unit and more in progress! https://lhcathome.cern.ch/lhcathome/result.php?resultid=334074794 ID: 45733 · Reply Quote