Message boards : ATLAS application : Checkpointing?
Message board moderation

To post messages, you must log in.

AuthorMessage
keputnam

Send message
Joined: 27 Sep 04
Posts: 94
Credit: 3,758,301
RAC: 4,865
Message 43393 - Posted: 22 Sep 2020, 19:22:22 UTC
Last modified: 22 Sep 2020, 19:22:36 UTC

Can anybody tell me why a program, would take regular checkpoints, and then ignore them on restart and start from scratch? For ATLAS, BOINC manager/properties shows "time since last checkpoint," which resets every two minutes

I shut down BOINC to apply some Windows service

On restart as soon as VBox manager initialized and BOINC monitoring starts, it correctly shows a certain member of events completed

Then at some later point, it resets to zero and starts at event 1 again


???
ID: 43393 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1966
Credit: 140,116,625
RAC: 87,017
Message 43394 - Posted: 22 Sep 2020, 19:53:35 UTC - in response to Message 43393.  
Last modified: 22 Sep 2020, 19:55:18 UTC

Checkpointing is a BOINC functionality and requires special communication between the BOINC client and the app.
In this case the app is the vboxwrapper which controls the VM but does no calculation.
Hence, checkpointing the vboxwrapper would be useless.

What you request is to checkpoint the processes inside the VM.
Depending on the internal state of the VM it does some kind of checkpointing but that's not the same that you would expect from the BOINC perspective.


If you refer to the ATLAS monitoring at console 2 (ALT-F2).
This is a process inside the VM that gets it's data from a copy of the original ATLAS logs.
After a VM restart or resume the old logfile copy may still be present for a while.
To restart the monitoring from the scratch press CTRL-c at console 2.
ID: 43394 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1094
Credit: 6,832,351
RAC: 605
Message 43396 - Posted: 23 Sep 2020, 11:16:00 UTC - in response to Message 43393.  
Last modified: 23 Sep 2020, 11:21:50 UTC

For ATLAS, BOINC manager/properties shows "time since last checkpoint," which resets every two minutes
Only the checkpoint file is updated: "writing the progress and used cpu-seconds"
To checkpoint the science the VM should be paused and the state should be written to disk.
This is still possible with the vbox_wrapper, but LHC decided not to use this because of lots of reads/writes to disk specially when you have running several vbox-tasks.
Instead the save to disk is only used when BOINC's client is stopped.
This can take a while and the interruption should not take too long, cause ATLAS (and CMS) wants a continuously connection to the network and BOINC expects it's done within a minute (else the VM-state is "aborted".
For Theory this works OK. You may see this in a Theory result of mine, where you discover 'creating snapshots' (checkpoints). https://lhcathome.cern.ch/lhcathome/result.php?resultid=283464723

On restart as soon as VBox manager initialized and BOINC monitoring starts, it correctly shows a certain member of events completed

Then at some later point, it resets to zero and starts at event 1 again

???

After the startup at first the old log-lines are shown, before the job restarts.
ID: 43396 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1094
Credit: 6,832,351
RAC: 605
Message 43399 - Posted: 24 Sep 2020, 5:58:39 UTC - in response to Message 43396.  

For Theory this works OK. You may see this in a Theory result of mine, where you discover 'creating snapshots' (checkpoints).
I did the same for an ATLAS-task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=284558244 with a checkpoint interval of 20 minutes.
However to be working fine without my intervention, the rsc_disk_bound has to be increased server-side, cause a snapshot is deleted after you have a new one.
At some point you have next to the other files in a slot 3 vdi-files for a short moment. All files in that slot together will exceed the 10000000000 bytes allowed now.
ID: 43399 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1094
Credit: 6,832,351
RAC: 605
Message 43400 - Posted: 24 Sep 2020, 7:22:22 UTC - in response to Message 43399.  

With a next ATLAS-task, I tested it the hard way.
After the 2nd snapshot (10 events done) I rebooted the system with the running ATLAS-task and without stopping BOINC.
After the reboot the status of the VM is aborted, so without the snapshot, the task would start from scratch.
Now, without any intervention, the snapshot restored and the task resumes running from the 2nd snapshot without much loss.
I'll report the status of the result after the task has finished. This will last another ~8 hours (now 15 events done). Result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=284624356
ID: 43400 · Report as offensive     Reply Quote

Message boards : ATLAS application : Checkpointing?


©2022 CERN