Message boards :
ATLAS application :
Atlas Simulation tasks stuck
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0 |
Good evening, I keep realizing again and again that Atlas Simulation tasks never finish. They are on 100% after, lets say 8h, but 4d later are still in the same stage and count as "running" until I manually abort them. Any tips on how to troubleshoot? Example of tasks that was on 100% for days: Application ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas) Name ADlNDm6BAsyn9Rq4apoT9bVoABFKDmABFKDmw8qYDmABFKDmVnJmnm State Aborted by project Received 18/04/2021 04:31:59 Report deadline 25/04/2021 04:31:59 Resources 8 CPUs Estimated computation size 43,200 GFLOPs CPU time --- Elapsed time --- Executable vboxwrapper_26198ab7_windows_x86_64.exe Thanks, Rene |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
It looks like you don't have enough memory. The CMS take about 3 GB each, and the ATLAS probably almost as much on VBox. But even the native ATLAS (on Linux) takes 2 GB. |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,343,512 RAC: 62,427 |
Vbox-Projects (Atlas, CMS and Theory) need some tuning to optimize a successful task. Yeti's Checklist is very useful. You can use for the first experience Theory and Sixtrack. 16 GByte RAM is ok. Atlas and CMS-Tasks must be running complete, before you stop the PC. |
Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0 |
oh, so restarts due to OS updates can cause this behaviour? Do you maybe even have a link to that checklist? Thanks, Rene |
Send message Joined: 28 Sep 04 Posts: 709 Credit: 47,450,409 RAC: 28,876 |
oh, so restarts due to OS updates can cause this behaviour? On this same forum, one of the sticky threads. |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,343,512 RAC: 62,427 |
|
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
I have a couple of these also. Going on the 5th day and always approaching 100% but never reaching it. One of them gives a flashing numluck/scrolllock LED's when connecting to the VM using VBox manager. The other just shows the login screen and is still using 1 core of the client. I remember the tasks eventually ending and giving credit for the days of time consumed. Something has changed? Seriously do not want to abort 10 days of core usage for no credit if these will eventually end with credit. |
Send message Joined: 28 Dec 08 Posts: 334 Credit: 4,799,365 RAC: 2,329 |
I have more than enough memory and 16 cores. I have found that if BOINC runs more than one instance of ATLAS the task will stall at around 90% and drag on forever. It will only increase in finishing by .001% every few seconds. You need to limit the number of simultaneous tasks by LHC. I even tried running Theory and ATLAS together and ATLAS stalled out all the time. Now if you run ATLAS alone and set your cpu number to 4 in the preferences, then it will make ATLAS run only one task at a time. It should take under 8 hours to complete a task. If you run any other LHC projects, then you will have to write a special script to force BOINC to run only 1 task from LHC at a time. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
I have more than enough memory and 16 cores. I have traced this behavior to ATLAS job not saving it state properly. (There maybe other causes but this is certainly one). In BOINC advanced options, computing preferences, computing tab, set "switch between tasks every" to 9999 minutes so that ATLAS is never suspended when BOINC decides to swap WU tasks to accommodate resource share of multiple projects. (Or isolate your other projects from ATLAS). If you need to shut down BOINC then suspend your ATLAS WU's 1 at a time so they get saved properly in VBox manager. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
I have a couple of these also. Found a solution. Suspend the WU manually in BOINC. Open VBox Manager and find the newly saved state ATLAS VM. Delete the saved state, Now the WU has to start over and the corrupt execution state is gone. Although, the credit was pitiful and, since the WU's started from scratch, would probably have been just as well to abort them. BUT, nothing new is learned unless you experiment. 314464634 162960124 22 Apr 2021, 11:19:02 UTC 28 Apr 2021, 19:06:30 UTC Completed and validated 497,012.25 553,831.60 265.24 ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) windows_x86_64 313918637 162691404 21 Apr 2021, 9:24:05 UTC 29 Apr 2021, 2:26:32 UTC Completed and validated 536,236.41 500,751.30 408.92 ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) windows_x86_64 |
Send message Joined: 14 Jan 10 Posts: 1373 Credit: 9,156,130 RAC: 4,997 |
Maybe you could experiment with my xml-file for ATLAS. No savings to disk when BOINC stops/ system shutdown, but instead taking regular snapshots.of the ATLAS task. Replace the contents of the ATLAS_vbox_2.00_job.xml with the example in the thread ATLAS using VirtualBox with snapshots. In the options part of cc-config.xml you have to add a line <dont_check_file_sizes>1</dont_check_file_sizes> to avoid overwriting tha adjusted file by the project. You could change the snapshot (checkpoint) interval or increase the 'write to disk' in BOINC's preferences. |
Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0 |
Thanks for all the tips; I've now gone with adjusting the settings of swapping projects first and will see how that goes, else I work through the other tips :) |
Send message Joined: 1 Jul 06 Posts: 1 Credit: 4,214,657 RAC: 0 |
cc-config.xml? Where is that located? I don't see a file named that anywhere under C:\program files or c:\programdata\ and none of the XML's located under C:\ProgramData\BOINC\projects\ I found seem to have that option line. |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,605,800 RAC: 126,654 |
It must be "cc_config.xml" instead of "cc-config.xml". See the BOINC manual: https://boinc.berkeley.edu/wiki/Client_configuration |
Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0 |
well, I still have had some in the last couple of days, but slapped more RAM into the device now, let see if that stops them from happening with no reboot disturbing them :) Will update again in a week or three. |
Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0 |
Update: Happens less, but still happens. |
Send message Joined: 18 Apr 22 Posts: 3 Credit: 1,811,673 RAC: 0 |
I have the exact same issue. I keep on opening the computer expecting the tasks to be complete. The last 'hours' take days and then typically fails. Elapsed time keeps ticking but reamaining time doesn't change. I have 8 cores and 64GBs of RAM so that's not the issue. I have changed my preferences as suggested above but to no avail. All apps are up to date. I have tried re-loading the project as well. Nothing. |
Send message Joined: 15 Jun 08 Posts: 2500 Credit: 248,605,800 RAC: 126,654 |
You may make your computers visible for other volunteers here: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project In addition it is necessary to get some information from a typical stderr.txt, either - from a reported tasks (valid or invalid) or - post that log from a currently running task (see BOINC's slots dir) |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,343,512 RAC: 62,427 |
Welcome Philip, you can work thru Yeti's Checklist first. |
©2024 CERN