Message boards : CMS Application : Updated VirtualBox from 5.X to 7.X, CMS tasks keep failing.
Message board moderation

To post messages, you must log in.

AuthorMessage
doctorgonzo

Send message
Joined: 11 Sep 08
Posts: 3
Credit: 6,870,714
RAC: 1,375
Message 49886 - Posted: 3 Apr 2024, 22:21:45 UTC

Hello all, recently I noticed that I was using a very old version of VirtualBox (5.x), and so I updated to 7.x. Since then, CMS tasks have been routinely failing, although not 100% of the time.

The problem appears to be that the VM, once stopped, isn't restarting. One sample std_err log:

2024-04-03 12:36:51 (20768): Stopping VM.
2024-04-03 12:36:58 (20768): Successfully stopped VM.
2024-04-03 13:06:37 (19596): Detected: vboxwrapper 26206
2024-04-03 13:06:37 (19596): Detected: BOINC client v7.24.1
2024-04-03 13:06:38 (19596): Detected: VirtualBox VboxManage Interface (Version: 7.0.14)
2024-04-03 13:06:38 (19596): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2024-04-03 13:06:40 (19596): Starting VM using VBoxManage interface. (boinc_780e3e19cb3e9bf9, slot#1)
2024-04-03 13:16:02 (24640): Detected: vboxwrapper 26206
2024-04-03 13:16:02 (24640): Detected: BOINC client v7.24.1
2024-04-03 13:16:03 (24640): Detected: VirtualBox VboxManage Interface (Version: 7.0.14)
2024-04-03 13:16:03 (24640): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2024-04-03 13:16:04 (24640): Starting VM using VBoxManage interface. (boinc_780e3e19cb3e9bf9, slot#1)
2024-04-03 13:16:10 (24640): Error in start VM for VM: -2147467259
Command:
VBoxManage -q startvm "boinc_780e3e19cb3e9bf9" --type headless
Output:
VBoxManage.exe: error: Generic invalid state error. (VERR_INVALID_STATE)
VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole
Waiting for VM "boinc_780e3e19cb3e9bf9" to power on...

2024-04-03 13:16:10 (24640): VM failed to start.
2024-04-03 13:16:10 (24640): Could not start 
2024-04-03 13:16:10 (24640): ERROR: VM failed to start
2024-04-03 13:16:10 (24640): Powering off VM.
2024-04-03 13:16:10 (24640): Deregistering VM. (boinc_780e3e19cb3e9bf9, slot#1)
2024-04-03 13:16:11 (24640): Removing network bandwidth throttle group from VM.
2024-04-03 13:16:11 (24640): Removing VM from VirtualBox.


Another, slightly different example:
2024-04-03 15:38:25 (6724): Stopping VM.
2024-04-03 15:38:33 (6724): Successfully stopped VM.
2024-04-03 15:52:15 (22312): Detected: vboxwrapper 26206
2024-04-03 15:52:15 (22312): Detected: BOINC client v7.24.1
2024-04-03 15:52:15 (22312): Detected: VirtualBox VboxManage Interface (Version: 7.0.14)
2024-04-03 15:52:16 (22312): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2024-04-03 15:52:17 (22312): Starting VM using VBoxManage interface. (boinc_2db81241638a5c14, slot#9)
2024-04-03 16:23:12 (29740): Detected: vboxwrapper 26206
2024-04-03 16:23:12 (29740): Detected: BOINC client v7.24.1
2024-04-03 16:23:12 (29740): Detected: VirtualBox VboxManage Interface (Version: 7.0.14)
2024-04-03 16:23:13 (29740): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2024-04-03 16:23:19 (29740): VM was running
2024-04-03 16:23:19 (29740): Powering off VM.
2024-04-03 16:23:20 (29740): Error in poweroff VM for VM: -2135228414
Command:
VBoxManage -q controlvm "boinc_2db81241638a5c14" poweroff
Output:
VBoxManage.exe: error: Invalid machine state: Restoring (must be Running, Paused or Stuck)
VBoxManage.exe: error: Details: code VBOX_E_INVALID_VM_STATE (0x80bb0002), component ConsoleWrap, interface IConsole, callee IUnknown
VBoxManage.exe: error: Context: "PowerDown(progress.asOutParam())" at line 604 of file VBoxManageControlVM.cpp

2024-04-03 16:23:20 (29740): VM did not power off when requested.
2024-04-03 16:23:20 (29740): VM was successfully terminated.
2024-04-03 16:23:20 (29740): Starting VM using VBoxManage interface. (boinc_2db81241638a5c14, slot#9)
2024-04-03 16:23:51 (29740): Error in start VM for VM: -2135228409
Command:
VBoxManage -q startvm "boinc_2db81241638a5c14" --type headless
Output:
VBoxManage.exe: error: The machine 'boinc_2db81241638a5c14' is already locked by a session (or being locked or unlocked)
VBoxManage.exe: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component MachineWrap, interface IMachine, callee IUnknown
VBoxManage.exe: error: Context: "LaunchVMProcess(a->session, sessionType.raw(), ComSafeArrayAsInParam(aBstrEnv), progress.asOutParam())" at line 881 of file VBoxManageMisc.cpp


The VBox.log typically shows the following assertion failed error:
00:00:00.919694   Oracle VM VirtualBox Extension Pack (Version: 7.0.14 r161095; VRDE Module: VBoxVRDP; Crypto Module: VBoxPuelCrypto)
00:00:00.920728 Console: Machine state changed to 'Restoring'
00:00:01.016011 VRDP: Statistics created: [full], enabled: 0.
00:00:01.016813 VRDP: VRDP: VD: Frames=10 MinMS=15 MaxMS=300 HistoryMS=2000 VideoMS=300
00:00:01.016909 VRDP: Failed to query a property: 9, rc = VERR_NOT_SUPPORTED
00:00:01.036775 VRDP: TCP server listening on port 13675 (IPv4 and IPv6).
00:00:01.037352 VRDP: Audio: rate correction mode 0x3.
00:00:01.037412 VRDE: loaded version 4 of the server.
00:00:01.037438 VRDE: [IMAGE]
00:00:01.037446 VRDE: [MOUSEPTR]
00:00:01.037453 VRDE: [SCARD]
00:00:01.037458 VRDE: [TSMFRAW]
00:00:01.037462 VRDE: [VIDEOIN]
00:00:01.037467 VRDE: [VRDE::INPUT]
00:00:01.125367 SUP: Opened VMMR0.r0 (C:\Program Files\Oracle\VirtualBox/VMMR0.r0) at 0xXXXXXXXXXXXXXXXX loaded by the native ring-0 loader.
00:00:01.125442 SUP: windbg> .reload /f C:\Program Files\Oracle\VirtualBox/VMMR0.r0=0xXXXXXXXXXXXXXXXX
00:00:01.171470 
00:00:01.171474 !!Assertion Failed!!
00:00:01.171475 Expression: RT_VALID_PTR(pVM)
00:00:01.171477 Location  : D:\tinderboxa\win-7.0\src\VBox\VMM\VMMR3\VM.cpp(602) int __cdecl vmR3CreateU(struct UVM *,unsigned int,int (__cdecl *)(struct UVM *,struct VM *,const struct VMMR3VTABLE *,void *) noexcept,void *)
00:00:01.171479 Stack     :
00:00:01.171479 00007ffe2af3f020 VBoxRT.dll!RTAssertMsg1+0xe0 (rva:0x13f020)
00:00:01.171480 00007ffe29344e24 VBoxVMM.DLL!TRPMR3ResetCpu+0x164 (rva:0xd4e24)
00:00:01.171480 00007ffe2934d425 VBoxVMM.DLL!VMR3WaitForDeviceReady+0x635 (rva:0xdd425)
00:00:01.171480 00007ffe2934e50c VBoxVMM.DLL!VMR3ReqPriorityCallWaitU+0x15c (rva:0xde50c)
00:00:01.171480 00007ffe2934c622 VBoxVMM.DLL!VMR3Teleport+0x18b2 (rva:0xdc622)
00:00:01.171480 00007ffe2af56876 VBoxRT.dll!RTThreadWaitNoResume+0x336 (rva:0x156876)
00:00:01.171480 00007ffe2b004434 VBoxRT.dll!RTSystemGetNtVersion+0x1b4 (rva:0x204434)
00:00:01.171481 00007fff245f9363 ucrtbase.dll!_recalloc+0xa3 (rva:0x29363)
00:00:01.171481 00007fff2606257d KERNEL32.DLL!BaseThreadInitThunk+0x1d (rva:0x1257d)
00:00:01.171481 00007fff270aaa58 ntdll.dll!RtlUserThreadStart+0x28 (rva:0x5aa58)
00:00:01.171481 0000000000000000
00:00:01.171482 
00:00:01.196530 pVM=0000000000000000 pVMR0=000000000000000c
00:00:01.197802 
00:00:01.197803 !!! rtR3WinUnhandledXcptFilter caught an exception on thread 0000000000003f20 in 29924 !!!
00:00:01.197814 
00:00:01.197814 ExceptionCode=0x80000003 ExceptionFlags=0x00000000 ExceptionAddress=00007ffe29344e3f
00:00:01.197820 ExceptionInformation[0]=0000000000000000
00:00:01.197826 
00:00:01.197826 cs:rip=0033:00007ffe29344e3f
00:00:01.197832 ss:rsp=002b:000000ffda8ffa00 rbp=0000000000000001
00:00:01.197837 rax=0000000000000001 rcx=4ae69112c0280000 rdx=0000000000000000 rbx=0000000000000000
00:00:01.197843 rsi=0000000000000000 rdi=000002b55e1a3000 rsp=000000ffda8ffa00 rbp=0000000000000001
00:00:01.197848 r8 =0000000000000003 r9 =000000ffda8ff958 r10=00000fffe4b964cf r11=0000000000088000
00:00:01.197852 r12=0000000000000000 r13=000002b55e0bdae0 r14=00007ffe29688f40 r15=0000000000000001
00:00:01.197857 ds=002b es=002b fs=0053 gs=002b eflags=00000202
00:00:01.197861 p1home=0000000000000001 p2home=00007fff00000002 pe3home=0000000000000000
00:00:01.197937 p4home=0000000000000000 p5home=0000000000000001 pe6home=00007fff25153c90
00:00:01.197944    LastBranchToRip=0000000000000000    LastBranchFromRip=0000000000000000
00:00:01.197949 LastExceptionToRip=0000000000000000 LastExceptionFromRip=0000000000000000
00:00:01.197956 


The VBoxHardening.log for the above failure ends with this:
74e4.737c: supR3HardenedWinVerifyCacheProcessImportTodos: 'ws2_32.dll' -> '\Device\HarddiskVolume2\Windows\System32\ws2_32.dll' [rcNtRedir=0xc0150008]
74e4.737c: supR3HardenedScreenImage/Imports: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\ws2_32.dll
74e4.737c: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\system32\mswsock.dll (rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling]
74e4.737c: supR3HardenedScreenImage/NtCreateSection: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\mswsock.dll
74e4.737c: supR3HardenedDllNotificationCallback: load   00007fff23940000 LB 0x00069000 C:\WINDOWS\system32\mswsock.dll [fFlags=0x0]
74e4.737c: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\mswsock.dll
74e4.737c: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23940000 'C:\WINDOWS\system32\mswsock.dll'
74e4.737c: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\mswsock.dll
74e4.737c: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\system32\mswsock.dll (rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling]
74e4.737c: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23940000 'C:\WINDOWS\system32\mswsock.dll'
74e4.3f20: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\ole32.dll
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\system32\ole32.dll (rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling]
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff26b80000 'C:\WINDOWS\system32\ole32.dll'
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23420000 'C:\WINDOWS\system32\rsaenh.dll'
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff246f0000 'C:\WINDOWS\System32\crypt32.dll'
74e4.3f20: KiUserExceptionDispatcher: 0x80000003 (0000000000000000) @ 00007ffe29344e3f (flags=0x0)
  rax=0000000000000001 rbx=0000000000000000 rcx=4ae69112c0280000 rdx=0000000000000000
  rsi=0000000000000000 rdi=000002b55e1a3000 r8 =0000000000000003 r9 =000000ffda8ff958
  r10=00000fffe4b964cf r11=0000000000088000 r12=0000000000000000 r13=000002b55e0bdae0
  r14=00007ffe29688f40 r15=0000000000000001  P1=0000000000000001  P2=00007fff00000002
  rip=00007ffe29344e3f rsp=000000ffda8ffa00 rbp=0000000000000001    ctxflags=0010005f
  cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b    eflags=00000202   mxcrx=00001fa0
   P3=0000000000000001  P4=0000000000000000  P5=0000000000000001  P6=00007fff25153c90
  dr0=0000000000000000 dr1=0000000000000000 dr2=0000000000000000 dr3=0000000000000000
  dr6=0000000000000000 dr7=0000000000000000 vcr=00004a194b9d3568 dcr=0000000000000000
  lbt=0000000000000000 lbf=0000000000000000 lxt=0000000000000000 lxf=0000000000000000
74e4.3f20: supR3HardenedIsApiSetDll: ApiSetQueryApiSetPresence(ext-ms-win-kernel32-errorhandling-l1-1-0.dll) -> 0x0, fPresent=1
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=ext-ms-win-kernel32-errorhandling-l1-1-0.dll (rcNtResolve=0x0) *pfFlags=0x0 pwszSearchPath=0000000000000001:<flags> [calling]
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff26050000 'ext-ms-win-kernel32-errorhandling-l1-1-0.dll'
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff23420000 'C:\WINDOWS\system32\rsaenh.dll'
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff246f0000 'C:\WINDOWS\System32\crypt32.dll'
74e4.3f20: '\Device\HarddiskVolume2\Windows\System32\ntdll.dll' has no imports
74e4.3f20: supHardenedWinVerifyImageByHandle: -> 0 (\Device\HarddiskVolume2\Windows\SystVem32\ntdll.dll) WinVerifyTrust
74e4.3f20: supR3HardenedWinVerifyCacheInsert: \Device\HarddiskVolume2\Windows\System32\ntdll.dll
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\System32\ntdll.dll (Input=ntdll.dll, rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000801:<flags> [calling]
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff27050000 'C:\WINDOWS\System32\ntdll.dll'
74e4.3f20: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\KernelBase.dll
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: pName=C:\WINDOWS\System32\KernelBase.dll (Input=KernelBase, rcNtResolve=0xc0150008) *pfFlags=0x0 pwszSearchPath=0000000000000801:<flags> [calling]
74e4.3f20: supR3HardenedMonitor_LdrLoadDll: returns rcNt=0x0 hMod=00007fff24950000 'C:\WINDOWS\System32\KernelBase.dll'


What I've done so far
    Did a complete uninstall/reinstall following this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6014#48271
    Tried to fix the media registry by following this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6127


However, nothing seems to work. Anything else I can try?

ID: 49886 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2129
Credit: 160,203,932
RAC: 30,883
Message 49889 - Posted: 4 Apr 2024, 5:41:35 UTC - in response to Message 49886.  

Have you those CMS-Tasks finished before upgrading Virtualbox?
This is important.
ID: 49889 · Report as offensive     Reply Quote
doctorgonzo

Send message
Joined: 11 Sep 08
Posts: 3
Credit: 6,870,714
RAC: 1,375
Message 49894 - Posted: 4 Apr 2024, 12:48:43 UTC - in response to Message 49889.  

If you mean abort all outstanding tasks, I did when I did the complete reset/uninstall/reinstall procedure.
ID: 49894 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2129
Credit: 160,203,932
RAC: 30,883
Message 49895 - Posted: 4 Apr 2024, 13:00:30 UTC - in response to Message 49894.  

Yes, this was my question.
Now you have successful CMS-Tasks, but the Project have some trouble atm.
ID: 49895 · Report as offensive     Reply Quote
doctorgonzo

Send message
Joined: 11 Sep 08
Posts: 3
Credit: 6,870,714
RAC: 1,375
Message 49897 - Posted: 6 Apr 2024, 1:04:10 UTC

One thing I came across in my research is this thread that talks about this bug when multiple VMs are started simultaneously: https://forums.virtualbox.org/viewtopic.php?p=540029#p540029

I do have multiple CMS tasks in BOINC when this occurs. Maybe the client needs to have a way to stagger the restarts?
ID: 49897 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 811
Credit: 657,448,757
RAC: 223,379
Message 49900 - Posted: 6 Apr 2024, 11:59:40 UTC - in response to Message 49897.  
Last modified: 6 Apr 2024, 12:00:31 UTC

I don't seem to have too many issues, one of my computer has 125 VMs running at the same time at the moment. Gradually ramping up the WUs would probally be a BOINC feature, I'm not sure LHC could manage that.

The only trigger i can see is rebooting, this often seems to cause Theroy but not CMS to have a problem.

In regards to a messed up config of virtualbox I can not see what cause it but when its broke the the hanging disk images then it breaks all the work for that project.

I made my own clean up scripts for Linux, I could put them on Github and port to windows if they are helpful for other.

I made my own scripts since, for the LHC project I think the failure rate is acceptable to them, looking at my BOINC stats I have 0.98 % error rate at the moment from 14035 WUs. I can imagine that the error rate on the CERN side is less since the work inside the VM might have worked.

If I was to make the upgrade from 5.x to 7.x I would let all the work drain out then check the settings to make sure there was nothing hanging around then make the upgrade and restart the work. For a .x release I just install over the top after stopping the work.
ID: 49900 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 99
Credit: 3,343,457
RAC: 3,497
Message 49904 - Posted: 6 Apr 2024, 20:52:24 UTC - in response to Message 49900.  

Toby, what kind of data storage do you have to be able to run that many concurrent VMs? Mine seemed to have more trouble with a few tasks running with the data on a hard drive. I recently moved my data to a SSD that I wasn't using.
ID: 49904 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 811
Credit: 657,448,757
RAC: 223,379
Message 49905 - Posted: 7 Apr 2024, 9:22:21 UTC - in response to Message 49904.  

I have nvme SSDs, nothing specical just what came in the Lenovo from the factory, its like the OEM version of the 980 PRO. It could be, however, after startup they don't normally all finish/start at the same time?

ID: 49905 · Report as offensive     Reply Quote

Message boards : CMS Application : Updated VirtualBox from 5.X to 7.X, CMS tasks keep failing.


©2024 CERN