Message boards :
CMS Application :
Updated VirtualBox from 5.X to 7.X, CMS tasks keep failing.
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 Sep 08 Posts: 3 Credit: 6,870,714 RAC: 1,375 |
Hello all, recently I noticed that I was using a very old version of VirtualBox (5.x), and so I updated to 7.x. Since then, CMS tasks have been routinely failing, although not 100% of the time. The problem appears to be that the VM, once stopped, isn't restarting. One sample std_err log: 2024-04-03 12:36:51 (20768): Stopping VM. 2024-04-03 12:36:58 (20768): Successfully stopped VM. 2024-04-03 13:06:37 (19596): Detected: vboxwrapper 26206 2024-04-03 13:06:37 (19596): Detected: BOINC client v7.24.1 2024-04-03 13:06:38 (19596): Detected: VirtualBox VboxManage Interface (Version: 7.0.14) 2024-04-03 13:06:38 (19596): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2024-04-03 13:06:40 (19596): Starting VM using VBoxManage interface. (boinc_780e3e19cb3e9bf9, slot#1) 2024-04-03 13:16:02 (24640): Detected: vboxwrapper 26206 2024-04-03 13:16:02 (24640): Detected: BOINC client v7.24.1 2024-04-03 13:16:03 (24640): Detected: VirtualBox VboxManage Interface (Version: 7.0.14) 2024-04-03 13:16:03 (24640): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2024-04-03 13:16:04 (24640): Starting VM using VBoxManage interface. (boinc_780e3e19cb3e9bf9, slot#1) 2024-04-03 13:16:10 (24640): Error in start VM for VM: -2147467259 Command: VBoxManage -q startvm "boinc_780e3e19cb3e9bf9" --type headless Output: VBoxManage.exe: error: Generic invalid state error. (VERR_INVALID_STATE) VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole Waiting for VM "boinc_780e3e19cb3e9bf9" to power on... 2024-04-03 13:16:10 (24640): VM failed to start. 2024-04-03 13:16:10 (24640): Could not start 2024-04-03 13:16:10 (24640): ERROR: VM failed to start 2024-04-03 13:16:10 (24640): Powering off VM. 2024-04-03 13:16:10 (24640): Deregistering VM. (boinc_780e3e19cb3e9bf9, slot#1) 2024-04-03 13:16:11 (24640): Removing network bandwidth throttle group from VM. 2024-04-03 13:16:11 (24640): Removing VM from VirtualBox. Another, slightly different example: 2024-04-03 15:38:25 (6724): Stopping VM. 2024-04-03 15:38:33 (6724): Successfully stopped VM. 2024-04-03 15:52:15 (22312): Detected: vboxwrapper 26206 2024-04-03 15:52:15 (22312): Detected: BOINC client v7.24.1 2024-04-03 15:52:15 (22312): Detected: VirtualBox VboxManage Interface (Version: 7.0.14) 2024-04-03 15:52:16 (22312): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2024-04-03 15:52:17 (22312): Starting VM using VBoxManage interface. (boinc_2db81241638a5c14, slot#9) 2024-04-03 16:23:12 (29740): Detected: vboxwrapper 26206 2024-04-03 16:23:12 (29740): Detected: BOINC client v7.24.1 2024-04-03 16:23:12 (29740): Detected: VirtualBox VboxManage Interface (Version: 7.0.14) 2024-04-03 16:23:13 (29740): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2024-04-03 16:23:19 (29740): VM was running 2024-04-03 16:23:19 (29740): Powering off VM. 2024-04-03 16:23:20 (29740): Error in poweroff VM for VM: -2135228414 Command: VBoxManage -q controlvm "boinc_2db81241638a5c14" poweroff Output: VBoxManage.exe: error: Invalid machine state: Restoring (must be Running, Paused or Stuck) VBoxManage.exe: error: Details: code VBOX_E_INVALID_VM_STATE (0x80bb0002), component ConsoleWrap, interface IConsole, callee IUnknown VBoxManage.exe: error: Context: "PowerDown(progress.asOutParam())" at line 604 of file VBoxManageControlVM.cpp 2024-04-03 16:23:20 (29740): VM did not power off when requested. 2024-04-03 16:23:20 (29740): VM was successfully terminated. 2024-04-03 16:23:20 (29740): Starting VM using VBoxManage interface. (boinc_2db81241638a5c14, slot#9) 2024-04-03 16:23:51 (29740): Error in start VM for VM: -2135228409 Command: VBoxManage -q startvm "boinc_2db81241638a5c14" --type headless Output: VBoxManage.exe: error: The machine 'boinc_2db81241638a5c14' is already locked by a session (or being locked or unlocked) VBoxManage.exe: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component MachineWrap, interface IMachine, callee IUnknown VBoxManage.exe: error: Context: "LaunchVMProcess(a->session, sessionType.raw(), ComSafeArrayAsInParam(aBstrEnv), progress.asOutParam())" at line 881 of file VBoxManageMisc.cpp The VBox.log typically shows the following assertion failed error: 00:00:00.919694 Oracle VM VirtualBox Extension Pack (Version: 7.0.14 r161095; VRDE Module: VBoxVRDP; Crypto Module: VBoxPuelCrypto) 00:00:00.920728 Console: Machine state changed to 'Restoring' 00:00:01.016011 VRDP: Statistics created: [full], enabled: 0. 00:00:01.016813 VRDP: VRDP: VD: Frames=10 MinMS=15 MaxMS=300 HistoryMS=2000 VideoMS=300 00:00:01.016909 VRDP: Failed to query a property: 9, rc = VERR_NOT_SUPPORTED 00:00:01.036775 VRDP: TCP server listening on port 13675 (IPv4 and IPv6). 00:00:01.037352 VRDP: Audio: rate correction mode 0x3. 00:00:01.037412 VRDE: loaded version 4 of the server. 00:00:01.037438 VRDE: [IMAGE] 00:00:01.037446 VRDE: [MOUSEPTR] 00:00:01.037453 VRDE: [SCARD] 00:00:01.037458 VRDE: [TSMFRAW] 00:00:01.037462 VRDE: [VIDEOIN] 00:00:01.037467 VRDE: [VRDE::INPUT] 00:00:01.125367 SUP: Opened VMMR0.r0 (C:\Program Files\Oracle\VirtualBox/VMMR0.r0) at 0xXXXXXXXXXXXXXXXX loaded by the native ring-0 loader. 00:00:01.125442 SUP: windbg> .reload /f C:\Program Files\Oracle\VirtualBox/VMMR0.r0=0xXXXXXXXXXXXXXXXX 00:00:01.171470 00:00:01.171474 !!Assertion Failed!! 00:00:01.171475 Expression: RT_VALID_PTR(pVM) 00:00:01.171477 Location : D:\tinderboxa\win-7.0\src\VBox\VMM\VMMR3\VM.cpp(602) int __cdecl vmR3CreateU(struct UVM *,unsigned int,int (__cdecl *)(struct UVM *,struct VM *,const struct VMMR3VTABLE *,void *) noexcept,void *) 00:00:01.171479 Stack : 00:00:01.171479 00007ffe2af3f020 VBoxRT.dll!RTAssertMsg1+0xe0 (rva:0x13f020) 00:00:01.171480 00007ffe29344e24 VBoxVMM.DLL!TRPMR3ResetCpu+0x164 (rva:0xd4e24) 00:00:01.171480 00007ffe2934d425 VBoxVMM.DLL!VMR3WaitForDeviceReady+0x635 (rva:0xdd425) 00:00:01.171480 00007ffe2934e50c VBoxVMM.DLL!VMR3ReqPriorityCallWaitU+0x15c (rva:0xde50c) 00:00:01.171480 00007ffe2934c622 VBoxVMM.DLL!VMR3Teleport+0x18b2 (rva:0xdc622) 00:00:01.171480 00007ffe2af56876 VBoxRT.dll!RTThreadWaitNoResume+0x336 (rva:0x156876) 00:00:01.171480 00007ffe2b004434 VBoxRT.dll!RTSystemGetNtVersion+0x1b4 (rva:0x204434) 00:00:01.171481 00007fff245f9363 ucrtbase.dll!_recalloc+0xa3 (rva:0x29363) 00:00:01.171481 00007fff2606257d KERNEL32.DLL!BaseThreadInitThunk+0x1d (rva:0x1257d) 00:00:01.171481 00007fff270aaa58 ntdll.dll!RtlUserThreadStart+0x28 (rva:0x5aa58) 00:00:01.171481 0000000000000000 00:00:01.171482 00:00:01.196530 pVM=0000000000000000 pVMR0=000000000000000c 00:00:01.197802 00:00:01.197803 !!! rtR3WinUnhandledXcptFilter caught an exception on thread 0000000000003f20 in 29924 !!! 00:00:01.197814 00:00:01.197814 ExceptionCode=0x80000003 ExceptionFlags=0x00000000 ExceptionAddress=00007ffe29344e3f 00:00:01.197820 ExceptionInformation[0]=0000000000000000 00:00:01.197826 00:00:01.197826 cs:rip=0033:00007ffe29344e3f 00:00:01.197832 ss:rsp=002b:000000ffda8ffa00 rbp=0000000000000001 00:00:01.197837 rax=0000000000000001 rcx=4ae69112c0280000 rdx=0000000000000000 rbx=0000000000000000 00:00:01.197843 rsi=0000000000000000 rdi=000002b55e1a3000 rsp=000000ffda8ffa00 rbp=0000000000000001 00:00:01.197848 r8 =0000000000000003 r9 =000000ffda8ff958 r10=00000fffe4b964cf r11=0000000000088000 00:00:01.197852 r12=0000000000000000 r13=000002b55e0bdae0 r14=00007ffe29688f40 r15=0000000000000001 00:00:01.197857 ds=002b es=002b fs=0053 gs=002b eflags=00000202 00:00:01.197861 p1home=0000000000000001 p2home=00007fff00000002 pe3home=0000000000000000 00:00:01.197937 p4home=0000000000000000 p5home=0000000000000001 pe6home=00007fff25153c90 00:00:01.197944 LastBranchToRip=0000000000000000 LastBranchFromRip=0000000000000000 00:00:01.197949 LastExceptionToRip=0000000000000000 LastExceptionFromRip=0000000000000000 00:00:01.197956 The VBoxHardening.log for the above failure ends with this: 74e4.737c: supR3HardenedWinVerifyCacheProcessImportTodos: 'ws2_32.dll' -> '\Device\HarddiskVolume2\Windows\System32\ws2_32.dll' [rcNtRedir=0xc0150008] 74e4.737c: supR3HardenedScreenImage/Imports: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\ws2_32.dll 74e4.737c: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\system32\mswsock.dll (rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling] 74e4.737c: supR3HardenedScreenImage/NtCreateSection: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\mswsock.dll 74e4.737c: supR3HardenedDllNotificationCallback: load 00007fff23940000 LB 0x00069000 C:\WINDOWS\system32\mswsock.dll [fFlags=0x0] 74e4.737c: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\mswsock.dll 74e4.737c: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23940000 'C:\WINDOWS\system32\mswsock.dll' 74e4.737c: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\mswsock.dll 74e4.737c: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\system32\mswsock.dll (rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling] 74e4.737c: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23940000 'C:\WINDOWS\system32\mswsock.dll' 74e4.3f20: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\ole32.dll 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\system32\ole32.dll (rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling] 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff26b80000 'C:\WINDOWS\system32\ole32.dll' 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23420000 'C:\WINDOWS\system32\rsaenh.dll' 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff246f0000 'C:\WINDOWS\System32\crypt32.dll' 74e4.3f20: KiUserExceptionDispatcher: 0x80000003 (0000000000000000) @ 00007ffe29344e3f (flags=0x0) rax=0000000000000001 rbx=0000000000000000 rcx=4ae69112c0280000 rdx=0000000000000000 rsi=0000000000000000 rdi=000002b55e1a3000 r8 =0000000000000003 r9 =000000ffda8ff958 r10=00000fffe4b964cf r11=0000000000088000 r12=0000000000000000 r13=000002b55e0bdae0 r14=00007ffe29688f40 r15=0000000000000001 P1=0000000000000001 P2=00007fff00000002 rip=00007ffe29344e3f rsp=000000ffda8ffa00 rbp=0000000000000001 ctxflags=0010005f cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b eflags=00000202 mxcrx=00001fa0 P3=0000000000000001 P4=0000000000000000 P5=0000000000000001 P6=00007fff25153c90 dr0=0000000000000000 dr1=0000000000000000 dr2=0000000000000000 dr3=0000000000000000 dr6=0000000000000000 dr7=0000000000000000 vcr=00004a194b9d3568 dcr=0000000000000000 lbt=0000000000000000 lbf=0000000000000000 lxt=0000000000000000 lxf=0000000000000000 74e4.3f20: supR3HardenedIsApiSetDll: ApiSetQueryApiSetPresence(ext-ms-win-kernel32-errorhandling-l1-1-0.dll) -> 0x0, fPresent=1 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=ext-ms-win-kernel32-errorhandling-l1-1-0.dll (rcNtResolve=0x0) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling] 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff26050000 'ext-ms-win-kernel32-errorhandling-l1-1-0.dll' 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23420000 'C:\WINDOWS\system32\rsaenh.dll' 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff246f0000 'C:\WINDOWS\System32\crypt32.dll' 74e4.3f20: '\Device\HarddiskVolume2\Windows\System32\ntdll.dll' has no imports 74e4.3f20: supHardenedWinVerifyImageByHandle: -> 0 (\Device\HarddiskVolume2\Windows\SystVem32\ntdll.dll) WinVerifyTrust 74e4.3f20: supR3HardenedWinVerifyCacheInsert: \Device\HarddiskVolume2\Windows\System32\ntdll.dll 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\System32\ntdll.dll (Input=ntdll.dll, rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000801:<flags> [calling] 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff27050000 'C:\WINDOWS\System32\ntdll.dll' 74e4.3f20: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\KernelBase.dll 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\System32\KernelBase.dll (Input=KernelBase, rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000801:<flags> [calling] 74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff24950000 'C:\WINDOWS\System32\KernelBase.dll' What I've done so far
Tried to fix the media registry by following this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6127
|
Send message Joined: 2 May 07 Posts: 2129 Credit: 160,203,932 RAC: 30,883 |
Have you those CMS-Tasks finished before upgrading Virtualbox? This is important. |
Send message Joined: 11 Sep 08 Posts: 3 Credit: 6,870,714 RAC: 1,375 |
If you mean abort all outstanding tasks, I did when I did the complete reset/uninstall/reinstall procedure. |
Send message Joined: 2 May 07 Posts: 2129 Credit: 160,203,932 RAC: 30,883 |
Yes, this was my question. Now you have successful CMS-Tasks, but the Project have some trouble atm. |
Send message Joined: 11 Sep 08 Posts: 3 Credit: 6,870,714 RAC: 1,375 |
One thing I came across in my research is this thread that talks about this bug when multiple VMs are started simultaneously: https://forums.virtualbox.org/viewtopic.php?p=540029#p540029 I do have multiple CMS tasks in BOINC when this occurs. Maybe the client needs to have a way to stagger the restarts? |
Send message Joined: 27 Sep 08 Posts: 811 Credit: 657,448,757 RAC: 223,379 |
I don't seem to have too many issues, one of my computer has 125 VMs running at the same time at the moment. Gradually ramping up the WUs would probally be a BOINC feature, I'm not sure LHC could manage that. The only trigger i can see is rebooting, this often seems to cause Theroy but not CMS to have a problem. In regards to a messed up config of virtualbox I can not see what cause it but when its broke the the hanging disk images then it breaks all the work for that project. I made my own clean up scripts for Linux, I could put them on Github and port to windows if they are helpful for other. I made my own scripts since, for the LHC project I think the failure rate is acceptable to them, looking at my BOINC stats I have 0.98 % error rate at the moment from 14035 WUs. I can imagine that the error rate on the CERN side is less since the work inside the VM might have worked. If I was to make the upgrade from 5.x to 7.x I would let all the work drain out then check the settings to make sure there was nothing hanging around then make the upgrade and restart the work. For a .x release I just install over the top after stopping the work. |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,343,457 RAC: 3,497 |
Toby, what kind of data storage do you have to be able to run that many concurrent VMs? Mine seemed to have more trouble with a few tasks running with the data on a hard drive. I recently moved my data to a SSD that I wasn't using. |
Send message Joined: 27 Sep 08 Posts: 811 Credit: 657,448,757 RAC: 223,379 |
|
©2024 CERN