Thread 'Checkpointing?'

Author	Message
keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,602,669 RAC: 0	Message 43393 - Posted: 22 Sep 2020, 19:22:22 UTC Last modified: 22 Sep 2020, 19:22:36 UTC Can anybody tell me why a program, would take regular checkpoints, and then ignore them on restart and start from scratch? For ATLAS, BOINC manager/properties shows "time since last checkpoint," which resets every two minutes I shut down BOINC to apply some Windows service On restart as soon as VBox manager initialized and BOINC monitoring starts, it correctly shows a certain member of events completed Then at some later point, it resets to zero and starts at event 1 again ??? ID: 43393 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,083,482 RAC: 10,513	Message 43394 - Posted: 22 Sep 2020, 19:53:35 UTC - in response to Message 43393. Last modified: 22 Sep 2020, 19:55:18 UTC Checkpointing is a BOINC functionality and requires special communication between the BOINC client and the app. In this case the app is the vboxwrapper which controls the VM but does no calculation. Hence, checkpointing the vboxwrapper would be useless. What you request is to checkpoint the processes inside the VM. Depending on the internal state of the VM it does some kind of checkpointing but that's not the same that you would expect from the BOINC perspective. If you refer to the ATLAS monitoring at console 2 (ALT-F2). This is a process inside the VM that gets it's data from a copy of the original ATLAS logs. After a VM restart or resume the old logfile copy may still be present for a while. To restart the monitoring from the scratch press CTRL-c at console 2. ID: 43394 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,024,978 RAC: 1,494	Message 43396 - Posted: 23 Sep 2020, 11:16:00 UTC - in response to Message 43393. Last modified: 23 Sep 2020, 11:21:50 UTC For ATLAS, BOINC manager/properties shows "time since last checkpoint," which resets every two minutes Only the checkpoint file is updated: "writing the progress and used cpu-seconds" To checkpoint the science the VM should be paused and the state should be written to disk. This is still possible with the vbox_wrapper, but LHC decided not to use this because of lots of reads/writes to disk specially when you have running several vbox-tasks. Instead the save to disk is only used when BOINC's client is stopped. This can take a while and the interruption should not take too long, cause ATLAS (and CMS) wants a continuously connection to the network and BOINC expects it's done within a minute (else the VM-state is "aborted". For Theory this works OK. You may see this in a Theory result of mine, where you discover 'creating snapshots' (checkpoints). https://lhcathome.cern.ch/lhcathome/result.php?resultid=283464723 On restart as soon as VBox manager initialized and BOINC monitoring starts, it correctly shows a certain member of events completed Then at some later point, it resets to zero and starts at event 1 again ??? After the startup at first the old log-lines are shown, before the job restarts. ID: 43396 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,024,978 RAC: 1,494	Message 43399 - Posted: 24 Sep 2020, 5:58:39 UTC - in response to Message 43396. For Theory this works OK. You may see this in a Theory result of mine, where you discover 'creating snapshots' (checkpoints). I did the same for an ATLAS-task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=284558244 with a checkpoint interval of 20 minutes. However to be working fine without my intervention, the rsc_disk_bound has to be increased server-side, cause a snapshot is deleted after you have a new one. At some point you have next to the other files in a slot 3 vdi-files for a short moment. All files in that slot together will exceed the 10000000000 bytes allowed now. ID: 43399 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1530 Credit: 10,024,978 RAC: 1,494	Message 43400 - Posted: 24 Sep 2020, 7:22:22 UTC - in response to Message 43399. With a next ATLAS-task, I tested it the hard way. After the 2nd snapshot (10 events done) I rebooted the system with the running ATLAS-task and without stopping BOINC. After the reboot the status of the VM is aborted, so without the snapshot, the task would start from scratch. Now, without any intervention, the snapshot restored and the task resumes running from the 2nd snapshot without much loss. I'll report the status of the result after the task has finished. This will last another ~8 hours (now 15 events done). Result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=284624356 ID: 43400 · Reply Quote