Thread 'ATLAS using VirtualBox with snapshots'

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1531 Credit: 10,031,735 RAC: 1,241	Message 44768 - Posted: 19 Apr 2021, 14:03:59 UTC Last modified: 20 Apr 2021, 6:46:51 UTC There is the conviction that ATLAS (and CMS) should run without interruption. When the task is not kept in memory and is suspended by the user, BOINC-restart, system shutdown/reboot, saving the VM-state to disk often results into a stopped / aborted VM-state. In those cases a resume / restart of the task, the VM cannot be restored from the saved state. I adjusted the ATLAS vbox job.xml (or CMS one). Removed lines about disable checkpoints, enable vm-savestate and 2 lines about heartbeat (heartbeat check mostly hurts more than that it helps). Example of the vbox_job.xml file <vbox_job> <os_name>Linux26_64</os_name> <memory_size_mb>2241</memory_size_mb> <enable_network/> <enable_remotedesktop/> <enable_shared_directory/> <copy_to_shared>init_data.xml</copy_to_shared> <completion_trigger_file>atlas_done</completion_trigger_file> <minimum_checkpoint_interval>1200</minimum_checkpoint_interval> <pf_guest_port>80</pf_guest_port> </vbox_job> You see that I added a checkpoint interval line (20 minutes). In my example every 20 minutes a snapshot is written to disk. Om my slow system this write lasts about 30 seconds. Previous snapshots are deleted by vbox_wrapper. When a task is suspended for what ever reason the VM is just set poweroff. After a resume the VM is restored from the last and single snapshot. You may study my ATLAS example task and see the creation of snapshots, the long interruptions, the successful VM-restores and finally a successful valid task. Edit: When pausing a CMS-task for a longer period, the running resumed cms-job will be stopped after 20 minutes and a new cmsrun will start. ID: 44768 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 44774 - Posted: 20 Apr 2021, 20:11:45 UTC - in response to Message 44768. Thanks a lot for this, I will try it out on the dev project soon. I agree that suspend/resume should work as intended for ATLAS tasks. I do worry however that the snapshots every 20 mins may have an impact on the task efficiency. Saving state to disk can take quite a long time as several GB are written to disk. Do you have some numbers for similar tasks with and without the checkpoints to show the difference? ID: 44774 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1531 Credit: 10,031,735 RAC: 1,241	Message 44775 - Posted: 20 Apr 2021, 20:51:32 UTC - in response to Message 44774. Hello David, On my slow system the checkpoint sequence lasts about 30 seconds. Most systems will be faster. The user may increase the checkpoint interval by setting in BOINC 'request tasks to checkpoint at most every ... seconds' to a higher value. In the result output you find Setting checkpoint interval to 1200 seconds. (Higher value of (Preference: 120 seconds) or (Vbox_job.xml: 1200 seconds)) The 120s is a BOINC's setting (changeable by user) for the minimum interval and is superseeded by the 1200s of vbox_job.xml. You could also set the 1200s to 1800 in vbox_job.xml. That's arbitrary. The biggest advantage for the average user would be that ATLAS can be suspended for longer periods (overnight shutdowns) without keeping the task in memory.or a task not been killed by sudden system restarts like Windows Update. Also can BOINC suspend an ATLAS task because of a high priority task (short deadline) needing the core(s). ID: 44775 · Reply Quote