Message boards : ATLAS application : Task will not end despite being 100%
Message board moderation

To post messages, you must log in.

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 125
Credit: 1,137,119
RAC: 1,902
Message 37920 - Posted: 4 Feb 2019, 18:24:46 UTC

I have a task that has been running 2 days and 9 hrs now, It is 100% complete but will not end. I hate to abort it, but it looks like its time.

Any suggestions?
ID: 37920 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 15 Jun 08
Posts: 1107
Credit: 53,026,907
RAC: 133,477
Message 37921 - Posted: 4 Feb 2019, 19:11:45 UTC - in response to Message 37920.  

I have a task that has been running 2 days and 9 hrs now, It is 100% complete but will not end. I hate to abort it, but it looks like its time.

Any suggestions?

It's possible to shut down a VM gracefully.
This will at least save the stderr.txt for further analysis.

1. locate the task's "...\slots\x\shared" folder
2. create a completion trigger file there

This file doesn't need any content. Just create it.
In case of ATLAS it's name must be "atlas_done", in case of Theory it's name must be "shutdown".
ID: 37921 · Report as offensive     Reply Quote
Nikola Petkov

Send message
Joined: 6 Oct 17
Posts: 2
Credit: 357,595
RAC: 0
Message 37999 - Posted: 12 Feb 2019, 22:19:22 UTC

I created this file after the ATLAS task was stuck on 100% and 0 seconds for 30+ hours. It almost immediately ended with a computation error. Of course, I got no credit for it. 3 days, 17 hours of wasted 8 cores and electricity. This is in addition to all the other ATLAS that take forever and Theory that randomly fail. I'll just use my resources more efficiently at WCG.
ID: 37999 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 15 Jun 08
Posts: 1107
Credit: 53,026,907
RAC: 133,477
Message 38001 - Posted: 13 Feb 2019, 7:06:59 UTC - in response to Message 37999.  

At least 1 of your computers has lots of CPUs but not enough RAM, especially to run ATLAS in the way you configured it.

You may visit this page
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project
and change the values "max #cpus" and "max #tasks" to "1" for the venue your computers are attached to.

Then request fresh work and check if the tasks finish successfully within 12-18h (Theory only, ATLAS will be faster).
Then slightly (!) raise the numbers above starting with "max #tasks" until your computers become unstable again.
Use the last stable values.
ID: 38001 · Report as offensive     Reply Quote
Guiri-One[Andalucia]

Send message
Joined: 1 Feb 06
Posts: 43
Credit: 9,723
RAC: 0
Message 38003 - Posted: 13 Feb 2019, 14:38:14 UTC - in response to Message 37999.  

Definetely, is not a very good solution...

I have plenty of tasks failing due to similar "reasons" 8 GB ram and can't produce a single task if I use the laptop for internet surfing (VM gets stuck, "JOB postponed"...
ID: 38003 · Report as offensive     Reply Quote
Nikola Petkov

Send message
Joined: 6 Oct 17
Posts: 2
Credit: 357,595
RAC: 0
Message 38005 - Posted: 14 Feb 2019, 4:21:54 UTC - in response to Message 38001.  

I have 16 GB and 24 GB of RAM so memory shouldn't be an issue. Theory consistently works on one computer and fails on the other. ATLAS is hit or miss. I will try your suggestions. Thanks!
ID: 38005 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 39
Credit: 1,318,735
RAC: 8,575
Message 38006 - Posted: 14 Feb 2019, 5:29:27 UTC - in response to Message 38005.  

On your tasks, set the the number of cpus to the number of true cores you have or less. Intel computer, try six or less. I am not sure the Virtual
Box machines can start up with more than the number of true cores you have, not counting hyper threading or SMT (AMD). You may be able
to run a 12 cpu workunit since it looks like you have dual processors on the Intel machine. Try setting the AMD machine to 3 cpus. You can make these
adjustment using an 'app_config.xml'

You can also check out the checklist in the number crunching forum.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161
ID: 38006 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 125
Credit: 1,137,119
RAC: 1,902
Message 38111 - Posted: 6 Mar 2019, 6:18:42 UTC

If your talking to the other posters, then you need to mention them by name.
My system uses 4 physical cores and 3 virtual threads to run BOINC. Part of thread 8 or whatever is used to manage the GPU and all system tasks.

Memory is 16 gigs. Even with Atlas I use at most 90% of my memory.


But it is as someone else pointed out a problem in the workunits themselves.
Also ATLAS crew sent out a command to cancel all work units I had still on my system.
Now my credit is next to 0 or 0.

One of these days they got to get things straightened out. This is nuts!
ID: 38111 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 39
Credit: 1,318,735
RAC: 8,575
Message 38113 - Posted: 6 Mar 2019, 9:00:10 UTC - in response to Message 38111.  

greg_be, as long as you reply to the correct post in a chain, it shows in the header of the message. Mine just showed
"Message 38006 - Posted: 14 Feb 2019, 5:29:27 UTC - in response to Message 38005. "
ID: 38113 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 441
Credit: 7,944,547
RAC: 10,704
Message 38118 - Posted: 6 Mar 2019, 15:15:00 UTC - in response to Message 38111.  

But it is as someone else pointed out a problem in the workunits themselves.
Well, you might have received 2 bad workunits but without more evidence and history in the form of logs from your system it's difficult to know for sure what caused those 2 to fail to stop.

Also ATLAS crew sent out a command to cancel all work units I had still on my system.
Now my credit is next to 0 or 0.
It's perfectly normal and sensible for a project to cancel expired work units. If you don't know that by now then you have a lot of catching up to do.

One of these days they got to get things straightened out. This is nuts!

Wrong. You got to stop ignoring all the advice and face the fact that you're not doing what you need to do to run ATLAS.
ID: 38118 · Report as offensive     Reply Quote
Gunde

Send message
Joined: 9 Jan 15
Posts: 33
Credit: 259,295,195
RAC: 296,669
Message 38196 - Posted: 9 Mar 2019, 22:50:11 UTC
Last modified: 9 Mar 2019, 22:57:08 UTC

Tested 4 task to win host all 4 got suspended for a few hours and resumed late on. All task end after 3.5 hours and got valid (vbox 5.2.26).

Ignore estimated time but if task do not end after 2 days it would be time to abort it.
ID: 38196 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 125
Credit: 1,137,119
RAC: 1,902
Message 38197 - Posted: 9 Mar 2019, 22:55:31 UTC - in response to Message 38113.  
Last modified: 9 Mar 2019, 23:02:18 UTC

greg_be, as long as you reply to the correct post in a chain, it shows in the header of the message. Mine just showed
"Message 38006 - Posted: 14 Feb 2019, 5:29:27 UTC - in response to Message 38005. "

Yeah I realized to late what happened. That was just reply, not quote. Anyway....I'm not watching this post any more. I've got another one going that's generating more detailed and specific answers.
ID: 38197 · Report as offensive     Reply Quote
Ewin

Send message
Joined: 30 Jun 15
Posts: 1
Credit: 6,011,947
RAC: 4,924
Message 38871 - Posted: 16 May 2019, 10:54:16 UTC

I'm having massive issues trying to get an ATLAS task to complete without giving a COMPUTATION ERROR.
The tasks are all 6 CPU tasks (I'd previously tried 8, but they never finish or error out)
The 6CPU tasks typically run for almost 3 days on my machine.

The computer is dual E5-2683 v3 giving 28 actual cores and RAM is 128GB.
Windows 10 Pro, Hyper-V is disabled.
BOINC Manager 7.12.2 (x64), Virtual Box 6.0.4 r128413 with extension pack

It crunches though Theory Simulation (6 CPU) tasks with no issue.

Please if someone could help me out, I really do want to contribute.

Thank you,
Ewin
ID: 38871 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 401
Credit: 84,485,808
RAC: 78,108
Message 38873 - Posted: 16 May 2019, 13:56:45 UTC - in response to Message 38871.  

I'm having massive issues trying to get an ATLAS task to complete without giving a COMPUTATION ERROR.
The tasks are all 6 CPU tasks (I'd previously tried 8, but they never finish or error out)
The 6CPU tasks typically run for almost 3 days on my machine.

The computer is dual E5-2683 v3 giving 28 actual cores and RAM is 128GB.
Windows 10 Pro, Hyper-V is disabled.
BOINC Manager 7.12.2 (x64), Virtual Box 6.0.4 r128413 with extension pack

It crunches though Theory Simulation (6 CPU) tasks with no issue.

Please if someone could help me out, I really do want to contribute.

Thank you,
Ewin

I invite you to take a walk through my checklist


Supporting BOINC, a great concept !
ID: 38873 · Report as offensive     Reply Quote

Message boards : ATLAS application : Task will not end despite being 100%


©2019 CERN