Message boards : ATLAS application : Atlas Simulation tasks stuck
Message board moderation

To post messages, you must log in.

AuthorMessage
Rene Cleymans

Send message
Joined: 30 Dec 15
Posts: 5
Credit: 3,895,568
RAC: 0
Message 44799 - Posted: 23 Apr 2021, 19:42:56 UTC

Good evening,

I keep realizing again and again that Atlas Simulation tasks never finish. They are on 100% after, lets say 8h, but 4d later are still in the same stage and count as "running" until I manually abort them.

Any tips on how to troubleshoot?

Example of tasks that was on 100% for days:

Application
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Name
ADlNDm6BAsyn9Rq4apoT9bVoABFKDmABFKDmw8qYDmABFKDmVnJmnm
State
Aborted by project
Received
18/04/2021 04:31:59
Report deadline
25/04/2021 04:31:59
Resources
8 CPUs
Estimated computation size
43,200 GFLOPs
CPU time
---
Elapsed time
---
Executable
vboxwrapper_26198ab7_windows_x86_64.exe

Thanks,
Rene
ID: 44799 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 44800 - Posted: 24 Apr 2021, 0:33:09 UTC - in response to Message 44799.  

It looks like you don't have enough memory. The CMS take about 3 GB each, and the ATLAS probably almost as much on VBox.
But even the native ATLAS (on Linux) takes 2 GB.
ID: 44800 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,304
RAC: 104,845
Message 44802 - Posted: 24 Apr 2021, 8:38:20 UTC

Vbox-Projects (Atlas, CMS and Theory) need some tuning to optimize a successful task.
Yeti's Checklist is very useful.
You can use for the first experience Theory and Sixtrack. 16 GByte RAM is ok.
Atlas and CMS-Tasks must be running complete, before you stop the PC.
ID: 44802 · Report as offensive     Reply Quote
Rene Cleymans

Send message
Joined: 30 Dec 15
Posts: 5
Credit: 3,895,568
RAC: 0
Message 44824 - Posted: 26 Apr 2021, 16:38:53 UTC - in response to Message 44802.  

oh, so restarts due to OS updates can cause this behaviour?
Do you maybe even have a link to that checklist?

Thanks,
Rene
ID: 44824 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,149,324
RAC: 16,013
Message 44825 - Posted: 26 Apr 2021, 17:58:16 UTC - in response to Message 44824.  

oh, so restarts due to OS updates can cause this behaviour?
Do you maybe even have a link to that checklist?

Thanks,
Rene

On this same forum, one of the sticky threads.
ID: 44825 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,304
RAC: 104,845
Message 44826 - Posted: 26 Apr 2021, 17:58:27 UTC - in response to Message 44824.  
Last modified: 26 Apr 2021, 17:59:05 UTC

ID: 44826 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 44829 - Posted: 27 Apr 2021, 11:51:49 UTC

I have a couple of these also.

Going on the 5th day and always approaching 100% but never reaching it.
One of them gives a flashing numluck/scrolllock LED's when connecting to the VM using VBox manager.
The other just shows the login screen and is still using 1 core of the client.

I remember the tasks eventually ending and giving credit for the days of time consumed.
Something has changed?

Seriously do not want to abort 10 days of core usage for no credit if these will eventually end with credit.
ID: 44829 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 44837 - Posted: 28 Apr 2021, 6:06:40 UTC

I have more than enough memory and 16 cores.
I have found that if BOINC runs more than one instance of ATLAS the task will stall at around 90% and drag on forever. It will only increase in finishing by .001% every few seconds.
You need to limit the number of simultaneous tasks by LHC. I even tried running Theory and ATLAS together and ATLAS stalled out all the time.

Now if you run ATLAS alone and set your cpu number to 4 in the preferences, then it will make ATLAS run only one task at a time. It should take under 8 hours to complete a task.
If you run any other LHC projects, then you will have to write a special script to force BOINC to run only 1 task from LHC at a time.
ID: 44837 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 44840 - Posted: 29 Apr 2021, 4:36:09 UTC - in response to Message 44837.  

I have more than enough memory and 16 cores.
I have found that if BOINC runs more than one instance of ATLAS the task will stall at around 90% and drag on forever. It will only increase in finishing by .001% every few seconds.
You need to limit the number of simultaneous tasks by LHC. I even tried running Theory and ATLAS together and ATLAS stalled out all the time.

Now if you run ATLAS alone and set your cpu number to 4 in the preferences, then it will make ATLAS run only one task at a time. It should take under 8 hours to complete a task.
If you run any other LHC projects, then you will have to write a special script to force BOINC to run only 1 task from LHC at a time.


I have traced this behavior to ATLAS job not saving it state properly. (There maybe other causes but this is certainly one).

In BOINC advanced options, computing preferences, computing tab, set "switch between tasks every" to 9999 minutes so that ATLAS is never suspended when BOINC decides to swap WU tasks to accommodate resource share of multiple projects. (Or isolate your other projects from ATLAS).

If you need to shut down BOINC then suspend your ATLAS WU's 1 at a time so they get saved properly in VBox manager.
ID: 44840 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 44841 - Posted: 29 Apr 2021, 4:38:25 UTC - in response to Message 44829.  
Last modified: 29 Apr 2021, 4:44:21 UTC

I have a couple of these also.

Going on the 5th day and always approaching 100% but never reaching it.



Found a solution.

Suspend the WU manually in BOINC.

Open VBox Manager and find the newly saved state ATLAS VM.
Delete the saved state,

Now the WU has to start over and the corrupt execution state is gone.

Although, the credit was pitiful and, since the WU's started from scratch, would probably have been just as well to abort them.
BUT, nothing new is learned unless you experiment.

314464634	162960124	22 Apr 2021, 11:19:02 UTC	28 Apr 2021, 19:06:30 UTC	Completed and validated	497,012.25	553,831.60	265.24	ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas)
windows_x86_64


313918637	162691404	21 Apr 2021, 9:24:05 UTC	29 Apr 2021, 2:26:32 UTC	Completed and validated	536,236.41	500,751.30	408.92	ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas)
windows_x86_64
ID: 44841 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 44842 - Posted: 29 Apr 2021, 9:37:41 UTC

Maybe you could experiment with my xml-file for ATLAS. No savings to disk when BOINC stops/ system shutdown, but instead taking regular snapshots.of the ATLAS task.

Replace the contents of the ATLAS_vbox_2.00_job.xml with the example in the thread ATLAS using VirtualBox with snapshots.

In the options part of cc-config.xml you have to add a line
<dont_check_file_sizes>1</dont_check_file_sizes> to avoid overwriting tha adjusted file by the project.

You could change the snapshot (checkpoint) interval or increase the 'write to disk' in BOINC's preferences.
ID: 44842 · Report as offensive     Reply Quote
Rene Cleymans

Send message
Joined: 30 Dec 15
Posts: 5
Credit: 3,895,568
RAC: 0
Message 44843 - Posted: 29 Apr 2021, 15:39:45 UTC

Thanks for all the tips; I've now gone with adjusting the settings of swapping projects first and will see how that goes, else I work through the other tips :)
ID: 44843 · Report as offensive     Reply Quote
tschuldt

Send message
Joined: 1 Jul 06
Posts: 1
Credit: 4,214,657
RAC: 0
Message 45044 - Posted: 1 Jun 2021, 13:32:46 UTC

cc-config.xml? Where is that located? I don't see a file named that anywhere under C:\program files or c:\programdata\ and none of the XML's located under C:\ProgramData\BOINC\projects\ I found seem to have that option line.
ID: 45044 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,894,316
RAC: 138,096
Message 45045 - Posted: 1 Jun 2021, 13:52:25 UTC - in response to Message 45044.  

It must be "cc_config.xml" instead of "cc-config.xml".
See the BOINC manual:
https://boinc.berkeley.edu/wiki/Client_configuration
ID: 45045 · Report as offensive     Reply Quote
Rene Cleymans

Send message
Joined: 30 Dec 15
Posts: 5
Credit: 3,895,568
RAC: 0
Message 45425 - Posted: 7 Oct 2021, 11:36:52 UTC

well, I still have had some in the last couple of days, but slapped more RAM into the device now, let see if that stops them from happening with no reboot disturbing them :)

Will update again in a week or three.
ID: 45425 · Report as offensive     Reply Quote
Rene Cleymans

Send message
Joined: 30 Dec 15
Posts: 5
Credit: 3,895,568
RAC: 0
Message 45434 - Posted: 15 Oct 2021, 11:38:30 UTC - in response to Message 45425.  

Update: Happens less, but still happens.
ID: 45434 · Report as offensive     Reply Quote
Philip Nicholson

Send message
Joined: 18 Apr 22
Posts: 3
Credit: 1,811,673
RAC: 0
Message 46687 - Posted: 27 Apr 2022, 12:22:05 UTC - in response to Message 45434.  

I have the exact same issue. I keep on opening the computer expecting the tasks to be complete. The last 'hours' take days and then typically fails.
Elapsed time keeps ticking but reamaining time doesn't change.
I have 8 cores and 64GBs of RAM so that's not the issue.
I have changed my preferences as suggested above but to no avail.
All apps are up to date. I have tried re-loading the project as well. Nothing.
ID: 46687 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,894,316
RAC: 138,096
Message 46688 - Posted: 27 Apr 2022, 12:44:55 UTC - in response to Message 46687.  

You may make your computers visible for other volunteers here:
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project

In addition it is necessary to get some information from a typical stderr.txt, either
- from a reported tasks (valid or invalid) or
- post that log from a currently running task (see BOINC's slots dir)
ID: 46688 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,304
RAC: 104,845
Message 46692 - Posted: 27 Apr 2022, 21:59:55 UTC - in response to Message 46687.  

Welcome Philip,
you can work thru Yeti's Checklist first.
ID: 46692 · Report as offensive     Reply Quote

Message boards : ATLAS application : Atlas Simulation tasks stuck


©2024 CERN