Message boards : ATLAS application : ATLAS using VirtualBox with snapshots
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,800
RAC: 1,930
Message 44768 - Posted: 19 Apr 2021, 14:03:59 UTC
Last modified: 20 Apr 2021, 6:46:51 UTC

There is the conviction that ATLAS (and CMS) should run without interruption.
When the task is not kept in memory and is suspended by the user, BOINC-restart, system shutdown/reboot,
saving the VM-state to disk often results into a stopped / aborted VM-state.
In those cases a resume / restart of the task, the VM cannot be restored from the saved state.

I adjusted the ATLAS vbox job.xml (or CMS one).
Removed lines about disable checkpoints, enable vm-savestate and 2 lines about heartbeat (heartbeat check mostly hurts more than that it helps).

Example of the vbox_job.xml file
<vbox_job>
  <os_name>Linux26_64</os_name>
  <memory_size_mb>2241</memory_size_mb>
  <enable_network/>
  <enable_remotedesktop/>
  <enable_shared_directory/>
  <copy_to_shared>init_data.xml</copy_to_shared>
  <completion_trigger_file>atlas_done</completion_trigger_file>
  <minimum_checkpoint_interval>1200</minimum_checkpoint_interval>
  <pf_guest_port>80</pf_guest_port>
</vbox_job>

You see that I added a checkpoint interval line (20 minutes). In my example every 20 minutes a snapshot is written to disk.
Om my slow system this write lasts about 30 seconds. Previous snapshots are deleted by vbox_wrapper.

When a task is suspended for what ever reason the VM is just set poweroff.
After a resume the VM is restored from the last and single snapshot.

You may study my ATLAS example task and see the creation of snapshots, the long interruptions, the successful VM-restores and finally a successful valid task.

Edit: When pausing a CMS-task for a longer period, the running resumed cms-job will be stopped after 20 minutes and a new cmsrun will start.
ID: 44768 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 44774 - Posted: 20 Apr 2021, 20:11:45 UTC - in response to Message 44768.  

Thanks a lot for this, I will try it out on the dev project soon. I agree that suspend/resume should work as intended for ATLAS tasks.

I do worry however that the snapshots every 20 mins may have an impact on the task efficiency. Saving state to disk can take quite a long time as several GB are written to disk. Do you have some numbers for similar tasks with and without the checkpoints to show the difference?
ID: 44774 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,800
RAC: 1,930
Message 44775 - Posted: 20 Apr 2021, 20:51:32 UTC - in response to Message 44774.  

Hello David,

On my slow system the checkpoint sequence lasts about 30 seconds. Most systems will be faster.
The user may increase the checkpoint interval by setting in BOINC 'request tasks to checkpoint at most every ... seconds' to a higher value.

In the result output you find
Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 1200 seconds))

The 120s is a BOINC's setting (changeable by user) for the minimum interval and is superseeded by the 1200s of vbox_job.xml. You could also set the 1200s to 1800 in vbox_job.xml. That's arbitrary.
The biggest advantage for the average user would be that ATLAS can be suspended for longer periods (overnight shutdowns) without keeping the task in memory.or a task not been killed by sudden system restarts like Windows Update.
Also can BOINC suspend an ATLAS task because of a high priority task (short deadline) needing the core(s).
ID: 44775 · Report as offensive     Reply Quote

Message boards : ATLAS application : ATLAS using VirtualBox with snapshots


©2024 CERN