Message boards :
ATLAS application :
VM did not power off when requested
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Sep 08 Posts: 52 Credit: 70,121,231 RAC: 126,178 ![]() ![]() ![]() |
It seems that all my ATLAS tasks waste 5 minutes at the end waiting for VM to power off. For example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=289365725 2020-11-23 15:07:03 (16632): VM Completion File Detected. 2020-11-23 15:07:03 (16632): Powering off VM. 2020-11-23 15:12:05 (16632): VM did not power off when requested. 2020-11-23 15:12:05 (16632): VM was successfully terminated. 2020-11-23 15:12:05 (16632): Deregistering VM. (boinc_683247da3d5f3b86, slot#32) I've checked a dozen results so far and they all have same message at the end. How do I debug why VM isn't powering off when requested? Given this doesn't seem to affect results at all, why can't we just terminate the VM directly? PS: I am on Windows 10 64-bit Version 20H2 (OS Build 19042.630). Virtualbox version is 6.1.16 r140961. BOINC version 7.16.11. ![]() |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
Every Atlas-Task need 10 Minutes at the beginning and 10 Minutes at the end. In this time only one CPU is used. This is normal. |
Send message Joined: 14 Sep 08 Posts: 52 Credit: 70,121,231 RAC: 126,178 ![]() ![]() ![]() |
Yes, I realized the start and shutdown time is always this, but I am curious what the VM is doing during that 5 minutes. I mean if my computer always takes more than 5 minutes to shut down and I have to pull the plug each time, I probably would want to figure out why. Despite semi-related discussion in many threads, I am not able to find details for this specific 5 minutes. If it's not doing anything useful and given we are terminating VM at the end anyway, I wonder if we can just shorten the timeout, clean up the VM sooner and we can all get more work done. ![]() |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,513,519 RAC: 130,623 ![]() ![]() |
The 5 min delay is hardwired in the poweroff function of BOINC's vboxwrapper. There's also a comment why: https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_vboxmanage.cpp#L1235 The delay can only be seen on Windows hosts. VMs on Linux hosts shut down immediately. Since the VDI files used by Windows and Linux are the same it might be a Windows issue, either caused by VirtualBox, by Windows' write cache or ... |
Send message Joined: 14 Sep 08 Posts: 52 Credit: 70,121,231 RAC: 126,178 ![]() ![]() ![]() |
Thanks for the code pointer and explanation. Is the kernel log of VM kept somewhere in output directory? Since this only happens on Windows, it's probably an issue with VirtualBox. I kinda want to verify whether the VM received the shutdown command at least. Given the wrapper is BOINC code, I guess there is nothing a project can do. I haven't tried other VM projects in Windows yet. Let me see if this can be reproduced for VMs from other projects too. If so, it's probably something could be discussed in the BOINC github. ![]() |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,513,519 RAC: 130,623 ![]() ![]() |
To check whether the shutdown command is forwarded to the VM you may examine the vbox logfile. Example (Linux like pathname): <boinc_working_dir>/slots/1/boinc_d12ca06c8a326d3e/Logs/VBox.log Since the whole tree below <boinc_working_dir>/slots/1/ is removed when BOINC gets a "boinc_finish(0)" the file should be "tailed" to a save location. On Linux this would be done using the "tail" command. On Windows there's an equivalent command in Power Shell. This logfile is exactly the same that vboxwrapper polls to check for any VM status change. |
Send message Joined: 14 Sep 08 Posts: 52 Credit: 70,121,231 RAC: 126,178 ![]() ![]() ![]() |
Thanks. I grabbed output of https://lhcathome.cern.ch/lhcathome/result.php?resultid=289380496. VBox.log output: https://pastebin.com/YZyasHCf The power off actually executed successfully right away, but the task output still came after 5 minutes saying the VM failed to power off. The task log also had first state change: 2020-11-24 14:49:26 (13272): VM state change detected. (old = 'PoweredOff', new = 'Running') Clearly the poll on https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_vboxmanage.cpp#L1253 failed to see the state change to "poweredoff", and it doesn't log state change. From the log "VM did not power off when requested", it's clear "online" was still set to true after 5 minutes. I think the bug is in https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_common.cpp#L450, read_vm_log function itself. There is no more "Guest Log" in VBox.log after "02:32:35.088629 VMMDev: Guest Log: *** Success! Shutting down the machine. ***". Then if we look at the body of while loop, the first line.find(console) would have found Console: Machine state changed to 'Stopping', which still sets online to true. Then second line_pos = line.find("Guest Log:") would seek to the end of file, saving the cursor and miss out all state changes in between. I could be wrong, but if not, how is this working in Linux? Is this wrapper code here only specific to windows or the output of VBox.log slightly different? ![]() |
Send message Joined: 14 Sep 08 Posts: 52 Credit: 70,121,231 RAC: 126,178 ![]() ![]() ![]() |
|
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 295 ![]() ![]() |
Do you have the same problem of disconnection with the Virtualbox 6.1.12 from the Boinc-Homepage instead of the 6.1.16? |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,513,519 RAC: 130,623 ![]() ![]() |
The vboxwrapper source code is used for both, Windows and Linux. I don't see an obvious reason why they behave different. Some snippets from your logfile (ATLAS task): 02:32:35.088629 VMMDev: Guest Log: *** Success! Shutting down the machine. *** 02:32:35.632242 Console: Machine state changed to 'Stopping' 02:32:35.633348 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Stopping, InUninit=0) 02:32:35.633743 Changing the VM state from 'RUNNING' to 'POWERING_OFF' . . . 02:32:35.726881 GIM: KVM: Resetting MSRs 02:32:35.731356 Changing the VM state from 'DESTROYING' to 'TERMINATED' 02:32:35.732887 Console: Machine state changed to 'PoweredOff' The corresponding lines from mine (CMS task): 12:27:27.777923 VMMDev: Guest Log: [INFO] Condor exited with return value N/A. 12:27:27.847800 VMMDev: Guest Log: [INFO] Shutting Down. 12:27:28.928486 Console: Machine state changed to 'Stopping' 12:27:28.963166 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Stopping, InUninit=0) 12:27:29.064137 Changing the VM state from 'RUNNING' to 'POWERING_OFF' . . . 12:27:29.160787 GIM: KVM: Resetting MSRs 12:27:29.162977 Changing the VM state from 'DESTROYING' to 'TERMINATED' 12:27:29.173701 Console: Machine state changed to 'PoweredOff' Both have the same final lines. The corresponding lines from my stderr.txt: 2020-11-25 10:36:20 (10183): Guest Log: [INFO] Condor exited with return value N/A. 2020-11-25 10:36:20 (10183): Guest Log: [INFO] Shutting Down. 2020-11-25 10:36:20 (10183): VM Completion File Detected. 2020-11-25 10:36:20 (10183): VM Completion Message: Condor exited with return value N/A. . 2020-11-25 10:36:20 (10183): Powering off VM. 2020-11-25 10:36:21 (10183): Successfully stopped VM. 2020-11-25 10:36:21 (10183): Deregistering VM. (boinc_6f590d3471c48274, slot#3) 2020-11-25 10:36:21 (10183): Removing network bandwidth throttle group from VM. 2020-11-25 10:36:21 (10183): Removing VM from VirtualBox. 10:36:27 (10183): called boinc_finish(0) |
Send message Joined: 14 Sep 08 Posts: 52 Credit: 70,121,231 RAC: 126,178 ![]() ![]() ![]() |
|
©2025 CERN