Message boards : ATLAS application : VM did not power off when requested
Message board moderation

To post messages, you must log in.

AuthorMessage
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,521,452
RAC: 42,260
Message 43692 - Posted: 24 Nov 2020, 3:52:04 UTC
Last modified: 24 Nov 2020, 3:55:05 UTC

It seems that all my ATLAS tasks waste 5 minutes at the end waiting for VM to power off. For example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=289365725

2020-11-23 15:07:03 (16632): VM Completion File Detected.
2020-11-23 15:07:03 (16632): Powering off VM.
2020-11-23 15:12:05 (16632): VM did not power off when requested.
2020-11-23 15:12:05 (16632): VM was successfully terminated.
2020-11-23 15:12:05 (16632): Deregistering VM. (boinc_683247da3d5f3b86, slot#32)

I've checked a dozen results so far and they all have same message at the end. How do I debug why VM isn't powering off when requested?

Given this doesn't seem to affect results at all, why can't we just terminate the VM directly?

PS: I am on Windows 10 64-bit Version 20H2 (OS Build 19042.630). Virtualbox version is 6.1.16 r140961. BOINC version 7.16.11.
ID: 43692 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,095,661
RAC: 103,406
Message 43693 - Posted: 24 Nov 2020, 6:26:43 UTC

Every Atlas-Task need 10 Minutes at the beginning and 10 Minutes at the end.
In this time only one CPU is used. This is normal.
ID: 43693 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,521,452
RAC: 42,260
Message 43694 - Posted: 24 Nov 2020, 6:56:10 UTC - in response to Message 43693.  
Last modified: 24 Nov 2020, 6:57:28 UTC

Yes, I realized the start and shutdown time is always this, but I am curious what the VM is doing during that 5 minutes. I mean if my computer always takes more than 5 minutes to shut down and I have to pull the plug each time, I probably would want to figure out why.

Despite semi-related discussion in many threads, I am not able to find details for this specific 5 minutes. If it's not doing anything useful and given we are terminating VM at the end anyway, I wonder if we can just shorten the timeout, clean up the VM sooner and we can all get more work done.
ID: 43694 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,927,459
RAC: 137,774
Message 43695 - Posted: 24 Nov 2020, 7:25:52 UTC - in response to Message 43694.  

The 5 min delay is hardwired in the poweroff function of BOINC's vboxwrapper.
There's also a comment why:
https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_vboxmanage.cpp#L1235


The delay can only be seen on Windows hosts.
VMs on Linux hosts shut down immediately.
Since the VDI files used by Windows and Linux are the same it might be a Windows issue,
either caused by VirtualBox, by Windows' write cache or ...
ID: 43695 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,521,452
RAC: 42,260
Message 43696 - Posted: 24 Nov 2020, 17:08:04 UTC - in response to Message 43695.  

Thanks for the code pointer and explanation. Is the kernel log of VM kept somewhere in output directory? Since this only happens on Windows, it's probably an issue with VirtualBox. I kinda want to verify whether the VM received the shutdown command at least.

Given the wrapper is BOINC code, I guess there is nothing a project can do. I haven't tried other VM projects in Windows yet. Let me see if this can be reproduced for VMs from other projects too. If so, it's probably something could be discussed in the BOINC github.
ID: 43696 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,927,459
RAC: 137,774
Message 43697 - Posted: 24 Nov 2020, 18:38:42 UTC - in response to Message 43696.  

To check whether the shutdown command is forwarded to the VM you may examine the vbox logfile.
Example (Linux like pathname):
<boinc_working_dir>/slots/1/boinc_d12ca06c8a326d3e/Logs/VBox.log

Since the whole tree below <boinc_working_dir>/slots/1/ is removed when BOINC gets a "boinc_finish(0)" the file should be "tailed" to a save location.
On Linux this would be done using the "tail" command.
On Windows there's an equivalent command in Power Shell.

This logfile is exactly the same that vboxwrapper polls to check for any VM status change.
ID: 43697 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,521,452
RAC: 42,260
Message 43698 - Posted: 25 Nov 2020, 2:11:00 UTC - in response to Message 43697.  
Last modified: 25 Nov 2020, 2:12:16 UTC

Thanks. I grabbed output of https://lhcathome.cern.ch/lhcathome/result.php?resultid=289380496. VBox.log output: https://pastebin.com/YZyasHCf

The power off actually executed successfully right away, but the task output still came after 5 minutes saying the VM failed to power off. The task log also had first state change:
2020-11-24 14:49:26 (13272): VM state change detected. (old = 'PoweredOff', new = 'Running')


Clearly the poll on https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_vboxmanage.cpp#L1253 failed to see the state change to "poweredoff", and it doesn't log state change. From the log "VM did not power off when requested", it's clear "online" was still set to true after 5 minutes.

I think the bug is in https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_common.cpp#L450, read_vm_log function itself.

There is no more "Guest Log" in VBox.log after "02:32:35.088629 VMMDev: Guest Log: *** Success! Shutting down the machine. ***". Then if we look at the body of while loop, the first line.find(console) would have found Console: Machine state changed to 'Stopping', which still sets online to true. Then second line_pos = line.find("Guest Log:") would seek to the end of file, saving the cursor and miss out all state changes in between.

I could be wrong, but if not, how is this working in Linux? Is this wrapper code here only specific to windows or the output of VBox.log slightly different?
ID: 43698 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,521,452
RAC: 42,260
Message 43699 - Posted: 25 Nov 2020, 2:18:32 UTC - in response to Message 43698.  

Nvm, that parsing is just for that line, so it's not wrong. It's not clear how it's not finding all the state transition logs then. It should have parsed line-by-line and return last state... :-(
ID: 43699 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,095,661
RAC: 103,406
Message 43700 - Posted: 25 Nov 2020, 8:10:24 UTC - in response to Message 43699.  

Do you have the same problem of disconnection with the Virtualbox 6.1.12 from the Boinc-Homepage instead of the 6.1.16?
ID: 43700 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,927,459
RAC: 137,774
Message 43701 - Posted: 25 Nov 2020, 10:32:26 UTC - in response to Message 43698.  

The vboxwrapper source code is used for both, Windows and Linux.
I don't see an obvious reason why they behave different.


Some snippets from your logfile (ATLAS task):
02:32:35.088629 VMMDev: Guest Log:  *** Success! Shutting down the machine. ***
02:32:35.632242 Console: Machine state changed to 'Stopping'
02:32:35.633348 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Stopping, InUninit=0)
02:32:35.633743 Changing the VM state from 'RUNNING' to 'POWERING_OFF'
.
.
.
02:32:35.726881 GIM: KVM: Resetting MSRs
02:32:35.731356 Changing the VM state from 'DESTROYING' to 'TERMINATED'
02:32:35.732887 Console: Machine state changed to 'PoweredOff'



The corresponding lines from mine (CMS task):
12:27:27.777923 VMMDev: Guest Log: [INFO] Condor exited with return value N/A.
12:27:27.847800 VMMDev: Guest Log: [INFO] Shutting Down.
12:27:28.928486 Console: Machine state changed to 'Stopping'
12:27:28.963166 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Stopping, InUninit=0)
12:27:29.064137 Changing the VM state from 'RUNNING' to 'POWERING_OFF'
.
.
.
12:27:29.160787 GIM: KVM: Resetting MSRs
12:27:29.162977 Changing the VM state from 'DESTROYING' to 'TERMINATED'
12:27:29.173701 Console: Machine state changed to 'PoweredOff'

Both have the same final lines.


The corresponding lines from my stderr.txt:
2020-11-25 10:36:20 (10183): Guest Log: [INFO] Condor exited with return value N/A.
2020-11-25 10:36:20 (10183): Guest Log: [INFO] Shutting Down.
2020-11-25 10:36:20 (10183): VM Completion File Detected.
2020-11-25 10:36:20 (10183): VM Completion Message: Condor exited with return value N/A.
.
2020-11-25 10:36:20 (10183): Powering off VM.
2020-11-25 10:36:21 (10183): Successfully stopped VM.
2020-11-25 10:36:21 (10183): Deregistering VM. (boinc_6f590d3471c48274, slot#3)
2020-11-25 10:36:21 (10183): Removing network bandwidth throttle group from VM.
2020-11-25 10:36:21 (10183): Removing VM from VirtualBox.
10:36:27 (10183): called boinc_finish(0)
ID: 43701 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,521,452
RAC: 42,260
Message 43712 - Posted: 25 Nov 2020, 16:55:26 UTC - in response to Message 43700.  

Do you have the same problem of disconnection with the Virtualbox 6.1.12 from the Boinc-Homepage instead of the 6.1.16?

I haven't tried. Once the current pending work finishes I can switch to 6.1.12 and check if it helps.
ID: 43712 · Report as offensive     Reply Quote

Message boards : ATLAS application : VM did not power off when requested


©2024 CERN