Thread 'Atlas Simulation tasks stuck'

Author	Message
Rene Cleymans Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0	Message 44799 - Posted: 23 Apr 2021, 19:42:56 UTC Good evening, I keep realizing again and again that Atlas Simulation tasks never finish. They are on 100% after, lets say 8h, but 4d later are still in the same stage and count as "running" until I manually abort them. Any tips on how to troubleshoot? Example of tasks that was on 100% for days: Application ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas) Name ADlNDm6BAsyn9Rq4apoT9bVoABFKDmABFKDmw8qYDmABFKDmVnJmnm State Aborted by project Received 18/04/2021 04:31:59 Report deadline 25/04/2021 04:31:59 Resources 8 CPUs Estimated computation size 43,200 GFLOPs CPU time --- Elapsed time --- Executable vboxwrapper_26198ab7_windows_x86_64.exe Thanks, Rene ID: 44799 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 44800 - Posted: 24 Apr 2021, 0:33:09 UTC - in response to Message 44799. It looks like you don't have enough memory. The CMS take about 3 GB each, and the ATLAS probably almost as much on VBox. But even the native ATLAS (on Linux) takes 2 GB. ID: 44800 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,587,310 RAC: 30,683	Message 44802 - Posted: 24 Apr 2021, 8:38:20 UTC Vbox-Projects (Atlas, CMS and Theory) need some tuning to optimize a successful task. Yeti's Checklist is very useful. You can use for the first experience Theory and Sixtrack. 16 GByte RAM is ok. Atlas and CMS-Tasks must be running complete, before you stop the PC. ID: 44802 · Reply Quote

Rene Cleymans Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0	Message 44824 - Posted: 26 Apr 2021, 16:38:53 UTC - in response to Message 44802. oh, so restarts due to OS updates can cause this behaviour? Do you maybe even have a link to that checklist? Thanks, Rene ID: 44824 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 804 Credit: 65,710,021 RAC: 24,462	Message 44825 - Posted: 26 Apr 2021, 17:58:16 UTC - in response to Message 44824. oh, so restarts due to OS updates can cause this behaviour? Do you maybe even have a link to that checklist? Thanks, Rene On this same forum, one of the sticky threads. ID: 44825 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,587,310 RAC: 30,683	Message 44826 - Posted: 26 Apr 2021, 17:58:27 UTC - in response to Message 44824. Last modified: 26 Apr 2021, 17:59:05 UTC https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=29359#29359 ID: 44826 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 44829 - Posted: 27 Apr 2021, 11:51:49 UTC I have a couple of these also. Going on the 5th day and always approaching 100% but never reaching it. One of them gives a flashing numluck/scrolllock LED's when connecting to the VM using VBox manager. The other just shows the login screen and is still using 1 core of the client. I remember the tasks eventually ending and giving credit for the days of time consumed. Something has changed? Seriously do not want to abort 10 days of core usage for no credit if these will eventually end with credit. ID: 44829 · Reply Quote

greg_be Send message Joined: 28 Dec 08 Posts: 349 Credit: 6,773,164 RAC: 1,281	Message 44837 - Posted: 28 Apr 2021, 6:06:40 UTC I have more than enough memory and 16 cores. I have found that if BOINC runs more than one instance of ATLAS the task will stall at around 90% and drag on forever. It will only increase in finishing by .001% every few seconds. You need to limit the number of simultaneous tasks by LHC. I even tried running Theory and ATLAS together and ATLAS stalled out all the time. Now if you run ATLAS alone and set your cpu number to 4 in the preferences, then it will make ATLAS run only one task at a time. It should take under 8 hours to complete a task. If you run any other LHC projects, then you will have to write a special script to force BOINC to run only 1 task from LHC at a time. ID: 44837 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 44840 - Posted: 29 Apr 2021, 4:36:09 UTC - in response to Message 44837. I have more than enough memory and 16 cores. I have found that if BOINC runs more than one instance of ATLAS the task will stall at around 90% and drag on forever. It will only increase in finishing by .001% every few seconds. You need to limit the number of simultaneous tasks by LHC. I even tried running Theory and ATLAS together and ATLAS stalled out all the time. Now if you run ATLAS alone and set your cpu number to 4 in the preferences, then it will make ATLAS run only one task at a time. It should take under 8 hours to complete a task. If you run any other LHC projects, then you will have to write a special script to force BOINC to run only 1 task from LHC at a time. I have traced this behavior to ATLAS job not saving it state properly. (There maybe other causes but this is certainly one). In BOINC advanced options, computing preferences, computing tab, set "switch between tasks every" to 9999 minutes so that ATLAS is never suspended when BOINC decides to swap WU tasks to accommodate resource share of multiple projects. (Or isolate your other projects from ATLAS). If you need to shut down BOINC then suspend your ATLAS WU's 1 at a time so they get saved properly in VBox manager. ID: 44840 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 44841 - Posted: 29 Apr 2021, 4:38:25 UTC - in response to Message 44829. Last modified: 29 Apr 2021, 4:44:21 UTC I have a couple of these also. Going on the 5th day and always approaching 100% but never reaching it. Found a solution. Suspend the WU manually in BOINC. Open VBox Manager and find the newly saved state ATLAS VM. Delete the saved state, Now the WU has to start over and the corrupt execution state is gone. Although, the credit was pitiful and, since the WU's started from scratch, would probably have been just as well to abort them. BUT, nothing new is learned unless you experiment. 314464634 162960124 22 Apr 2021, 11:19:02 UTC 28 Apr 2021, 19:06:30 UTC Completed and validated 497,012.25 553,831.60 265.24 ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) windows_x86_64 313918637 162691404 21 Apr 2021, 9:24:05 UTC 29 Apr 2021, 2:26:32 UTC Completed and validated 536,236.41 500,751.30 408.92 ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) windows_x86_64 ID: 44841 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1553 Credit: 10,081,726 RAC: 1,368	Message 44842 - Posted: 29 Apr 2021, 9:37:41 UTC Maybe you could experiment with my xml-file for ATLAS. No savings to disk when BOINC stops/ system shutdown, but instead taking regular snapshots.of the ATLAS task. Replace the contents of the ATLAS_vbox_2.00_job.xml with the example in the thread ATLAS using VirtualBox with snapshots. In the options part of cc-config.xml you have to add a line <dont_check_file_sizes>1</dont_check_file_sizes> to avoid overwriting tha adjusted file by the project. You could change the snapshot (checkpoint) interval or increase the 'write to disk' in BOINC's preferences. ID: 44842 · Reply Quote

Rene Cleymans Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0	Message 44843 - Posted: 29 Apr 2021, 15:39:45 UTC Thanks for all the tips; I've now gone with adjusting the settings of swapping projects first and will see how that goes, else I work through the other tips :) ID: 44843 · Reply Quote

tschuldt Send message Joined: 1 Jul 06 Posts: 1 Credit: 4,214,657 RAC: 0	Message 45044 - Posted: 1 Jun 2021, 13:32:46 UTC cc-config.xml? Where is that located? I don't see a file named that anywhere under C:\program files or c:\programdata\ and none of the XML's located under C:\ProgramData\BOINC\projects\ I found seem to have that option line. ID: 45044 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2749 Credit: 302,735,526 RAC: 75,861	Message 45045 - Posted: 1 Jun 2021, 13:52:25 UTC - in response to Message 45044. It must be "cc_config.xml" instead of "cc-config.xml". See the BOINC manual: https://boinc.berkeley.edu/wiki/Client_configuration ID: 45045 · Reply Quote

Rene Cleymans Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0	Message 45425 - Posted: 7 Oct 2021, 11:36:52 UTC well, I still have had some in the last couple of days, but slapped more RAM into the device now, let see if that stops them from happening with no reboot disturbing them :) Will update again in a week or three. ID: 45425 · Reply Quote

Rene Cleymans Send message Joined: 30 Dec 15 Posts: 5 Credit: 3,895,568 RAC: 0	Message 45434 - Posted: 15 Oct 2021, 11:38:30 UTC - in response to Message 45425. Update: Happens less, but still happens. ID: 45434 · Reply Quote

Philip Nicholson Send message Joined: 18 Apr 22 Posts: 3 Credit: 1,811,673 RAC: 0	Message 46687 - Posted: 27 Apr 2022, 12:22:05 UTC - in response to Message 45434. I have the exact same issue. I keep on opening the computer expecting the tasks to be complete. The last 'hours' take days and then typically fails. Elapsed time keeps ticking but reamaining time doesn't change. I have 8 cores and 64GBs of RAM so that's not the issue. I have changed my preferences as suggested above but to no avail. All apps are up to date. I have tried re-loading the project as well. Nothing. ID: 46687 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2749 Credit: 302,735,526 RAC: 75,861	Message 46688 - Posted: 27 Apr 2022, 12:44:55 UTC - in response to Message 46687. You may make your computers visible for other volunteers here: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project In addition it is necessary to get some information from a typical stderr.txt, either - from a reported tasks (valid or invalid) or - post that log from a currently running task (see BOINC's slots dir) ID: 46688 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,587,310 RAC: 30,683	Message 46692 - Posted: 27 Apr 2022, 21:59:55 UTC - in response to Message 46687. Welcome Philip, you can work thru Yeti's Checklist first. ID: 46692 · Reply Quote