Message boards : ATLAS application : Handle Vbox resumes
Message board moderation

To post messages, you must log in.

AuthorMessage
Guiri-One[Andalucia]

Send message
Joined: 1 Feb 06
Posts: 66
Credit: 9,723
RAC: 0
Message 44110 - Posted: 15 Jan 2021, 8:25:50 UTC

Hi team,

I am running Atlas in windows with real Vbox. I can see the monoting screen (ctr-alt-f2) and it progresses as it should.

However, I have no clue how to handle the situation, when I need to restart my system. I need to do it "often" but I dont want to loose already computed events.

Which is the safest way? I have tried "suspend" from boinc manager but when it comes backs, I see events restarted (or at least, "already finished" starts from 0).

Thc in advance¡

Javi F
ID: 44110 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1557
Credit: 57,671,115
RAC: 203,264
Message 44111 - Posted: 15 Jan 2021, 12:15:52 UTC - in response to Message 44110.  

ID: 44111 · Report as offensive     Reply Quote
Guiri-One[Andalucia]

Send message
Joined: 1 Feb 06
Posts: 66
Credit: 9,723
RAC: 0
Message 44112 - Posted: 15 Jan 2021, 12:26:53 UTC - in response to Message 44111.  

ID: 44112 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 44113 - Posted: 15 Jan 2021, 14:42:15 UTC - in response to Message 44110.  

It is not possible to pause ATLAS tasks. One task runs 200 events. If it is interrupted and resumed it will restart at zero.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 44113 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 585
Credit: 33,206,113
RAC: 18,479
Message 44114 - Posted: 15 Jan 2021, 15:56:13 UTC

My experience is that you can resume the Atlas tasks after computer reboot, at least on Windows. First I suspend all tasks that are not yet running to avoid them being started when I start suspending running tasks. Then I suspend the running tasks. I have selected in preferences to 'Leave non-GPU tasks in memory while suspended' so suspending them does not save them to disk. You can view the task status in vbox manager, after suspending running tasks they will show 'Paused'. Now you can exit and stop Boinc. When all running tasks are in 'Saved' state you should be safe. On my faster computer the 'ProgramData/Boinc' is on a SSD drive so it is fast enough to save the tasks before Boinc actually shuts down (this may take up to a minute to save them all). After that when I am ready to reboot the computer I will kill manually 'Virtualbox Interface' process from Task Manager as it seems to hang in running and preventing restart even I have closed the vbox manager. It may close itself if you wait long enough. Then I just restart the computer.

After restarting the computer my Boinc will autostart but all my tasks are suspended. Now I start to resume them one by one starting from the one that was most progressed. I monitor the disk activity and VBox manager to see when task is safely running again before resuming an other task. When all tasks that were running before restart are safely running, I will resume the rest of the tasks and everything is back to normal.

I followed the above just this week when Win 10 got an update and all tasks (Atlas, Theory and even CMS) seem to have survived. But if for some reason they would restart from the beginning, remember that you will get credit according to task runtime and you will get increased credit if the task is finished successfully. Here is an example of a CMS task that was restarted after about 9000 seconds runtime https://lhcathome.cern.ch/lhcathome/result.php?resultid=292888006 and here is one Atlas task https://lhcathome.cern.ch/lhcathome/result.php?resultid=294511025
ID: 44114 · Report as offensive     Reply Quote
Guiri-One[Andalucia]

Send message
Joined: 1 Feb 06
Posts: 66
Credit: 9,723
RAC: 0
Message 44132 - Posted: 18 Jan 2021, 8:53:06 UTC - in response to Message 44114.  

Thx for this usefull and constructive comment.

Definetely, I will once again by doing as you suggested and check results. My Pc has slow HD ..anyway, lets see.

Is not a matter of credits by waste of time and impotence: I had to reebot my machine once a day for 3 days in a row so I could not reach 200 events in any day...and job started from scratch day after :(


Meanwhile, I am crunching Theory, which seems to be shorter and , if resume is not working, not that many hours are wasted.

Have a nice day¡
ID: 44132 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 585
Credit: 33,206,113
RAC: 18,479
Message 44146 - Posted: 18 Jan 2021, 15:13:31 UTC - in response to Message 44132.  

You can make your Atlas tasks to run faster if you give it more CPU cores. It is a multicore application. First you can try with 2 CPUs and then see how it goes. Note that it needs more RAM to run but it is less than running two single core tasks simultaneously.
ID: 44146 · Report as offensive     Reply Quote
Guiri-One[Andalucia]

Send message
Joined: 1 Feb 06
Posts: 66
Credit: 9,723
RAC: 0
Message 44147 - Posted: 18 Jan 2021, 15:22:27 UTC - in response to Message 44146.  

I have only 2 cpus assigned. NOrammly not enought for Atlas to end in a reasonable time :)

Not sure if I can assign + 1 CPU once is started. Will try
ID: 44147 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 585
Credit: 33,206,113
RAC: 18,479
Message 44148 - Posted: 18 Jan 2021, 18:56:35 UTC - in response to Message 44147.  

The formula to count required memory is 3000 MB + n * 900 MB, where n is number of CPU cores. So single core task requires 3900 MB, 2 core task requires 4800 MB, 3 cores requires 5700 etc. I don't think you can change number of cores after task has started.
ID: 44148 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1103
Credit: 6,876,678
RAC: 849
Message 44154 - Posted: 19 Jan 2021, 10:02:54 UTC - in response to Message 44132.  

Definetely, I will once again by doing as you suggested and check results. My Pc has slow HD ..anyway, lets see.

Is not a matter of credits by waste of time and impotence: I had to reebot my machine once a day for 3 days in a row so I could not reach 200 events in any day...and job started from scratch day after :(


Meanwhile, I am crunching Theory, which seems to be shorter and , if resume is not working, not that many hours are wasted.

Harri already give a good explanation how you could save your work when you have to reboot.

Since you have a slow disk and maybe more VM's running, before BOINC stopping you could suspend the VM's one by one with 'Leave in memory' off.
Meanwhile you could watch in VirtualBox Manager the saving to disk of each job. Of course suspend before all tasks not yet started.

Suspending ATLAS overnight is not a good idea. I think the max interruption for ATLAS and CMS is about 1 hour. They need a network connection almost all the time.
For Theory it is not a problem to suspend the task for longer periods.
ID: 44154 · Report as offensive     Reply Quote
Guiri-One[Andalucia]

Send message
Joined: 1 Feb 06
Posts: 66
Credit: 9,723
RAC: 0
Message 44198 - Posted: 26 Jan 2021, 8:50:56 UTC - in response to Message 44154.  

Hi,

I did as suggested. I simply "suspend" the task via boincmanager, go to Vbox and see how the image running changes its status to "saved" and let it be for couple of mins untill I do same with next task. Monitoring disk I/O to be sure is done.
When I turn on my machine I resume one by one, following same approach. Always giving some extra monute, just in case.

It worked with Atlas perfectly :)

Thx¡

Javi
ID: 44198 · Report as offensive     Reply Quote

Message boards : ATLAS application : Handle Vbox resumes


©2022 CERN