Message boards : ATLAS application : Very long tasks in the queue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,436,849
RAC: 102,955
Message 29692 - Posted: 28 Mar 2017, 7:46:01 UTC - in response to Message 29691.  

You can find info on the events and their processing time on the console as described in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170

If you see progress in the console then the task is good and it's worth letting it run.

thanks, David, I've been using the console quite a lot lately, and it gives valuable Information :-)

That's why I was somewhat confused that in BOINC, the estimated time for the tasks is 2+ days (for a 2-core task which normaly runs 4-5 hours). Any idea how come?
ID: 29692 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,436,849
RAC: 102,955
Message 29694 - Posted: 28 Mar 2017, 9:47:40 UTC - in response to Message 29692.  

that's why I was somewhat confused that in BOINC, the estimated time for the tasks is 2+ days (for a 2-core task which normaly runs 4-5 hours). Any idea how come?

on another machine, 10 minutes ago another task with taskID=10995522 started. Will most probably be 50 events, right?
BOINC Shows 4days+ as remaining time.
In reality, this 1-core task will get finished within 5-8 hours.

What exactly is it that irritates BOINC to such an extent?
ID: 29694 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29696 - Posted: 28 Mar 2017, 10:13:44 UTC - in response to Message 29694.  
Last modified: 28 Mar 2017, 10:52:59 UTC

on another machine, 10 minutes ago another task with taskID=10995522 started. Will most probably be 50 events, right?

10995522 jobs has 100 events. (When you would have run it on a dual core the 2 cores would each do about 50 events, but depending on the event processing time it also could be 49-51, 48-52 etc.)

BOINC Shows 4days+ as remaining time.
In reality, this 1-core task will get finished within 5-8 hours.

What exactly is it that irritates BOINC to such an extent?

BOINC server calculates the duration rsc_fpops_est (history of returned tasks from your machine) / p_fpops (your benchmark)
On 25 March you had returned a task with 136,526.20 cpu seconds
The rsc_fpops_est is very slowly adjusted after returning faster (evt. smaller) tasks.

E.g. from my machine: <rsc_fpops_est>1814400000000000</rsc_fpops_est> / <p_fpops>3730968000</p_fpops> makes 4.6 days
ID: 29696 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29699 - Posted: 28 Mar 2017, 19:02:33 UTC

Not from the 1000-events batch with taskID=10959636,
but jobs with taskID=10995517 and 'only' 100 events are also running rather long.
On a dual core VM about 7 events done in 2 hours (incl init-phase). Event average 1490 seconds.
Expected runtime on 2nd generation i7 >21 hours. BOINC calculates 106 hours.
ID: 29699 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,436,849
RAC: 102,955
Message 29700 - Posted: 28 Mar 2017, 19:58:52 UTC - in response to Message 29699.  

jobs with taskID=10995517 and 'only' 100 events are also running rather long.
On a dual core VM about 7 events done in 2 hours (incl init-phase). Event average 1490 seconds.
Expected runtime on 2nd generation i7 >21 hours. BOINC calculates 106 hours.

same here; on the machine where the old QuadCore Q9550 processor has a problem with 2-core tasks, I am trying a 3-core Task (out of curiosity), it has taskID=10995517. And, to my big surprise, it's been running well for 3:15 hrs. now; the console shows 10 task processed.
According to the Windows Task Manager, 3 cores are being used, and RAM usage is accordingly.

I'll keep my fingers crossed for this 3-core taks, but no 2-core task before has run more than about 15 minutes before failing (no CPU use suddenly, no RAM use).
So it would be very strange that this PC (with the old processor) can run a 3-core task, but NOT a 2-core task.
ID: 29700 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,179,424
RAC: 105,469
Message 29701 - Posted: 28 Mar 2017, 20:46:19 UTC
Last modified: 28 Mar 2017, 20:47:13 UTC

3 Core are running automaticly with 4400 MByte RAM.
2 Core and one Core need a app_config.xml with 4.400 MByte RAM.
ID: 29701 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,026
RAC: 136,383
Message 29702 - Posted: 28 Mar 2017, 20:56:21 UTC - in response to Message 29701.  

3 Core are running automaticly with 4400 MByte RAM.
2 Core and one Core need a app_config.xml with 4.400 MByte RAM.

My 1 core WUs run with 3400 MB by default since the last project reset.
ID: 29702 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,436,849
RAC: 102,955
Message 29703 - Posted: 29 Mar 2017, 6:19:54 UTC - in response to Message 29700.  
Last modified: 29 Mar 2017, 6:30:40 UTC

I'll keep my fingers crossed for this 3-core taks, but no 2-core task before has run more than about 15 minutes before failing (no CPU use suddenly, no RAM use).
So it would be very strange that this PC (with the old processor) can run a 3-core task, but NOT a 2-core task.

Unfortunately, the 3-core task did not work out either. When I saw that only 2 out of 3 cores are utilized, I opened the console and saw the following:


Obviously, showing the image here does not work, so here is the URL of the image:

http://workupload.com/file/APXqjhw

Can anyone tell from the console what the problem is?
ID: 29703 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29704 - Posted: 29 Mar 2017, 9:37:09 UTC - in response to Message 29703.  


Can anyone tell from the console what the problem is?

I'm not an expert, but obvious something wrong with virtual memory mapping.
ID: 29704 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29705 - Posted: 29 Mar 2017, 10:50:24 UTC
Last modified: 29 Mar 2017, 10:50:53 UTC

HM, I have also one PC that has always crashed with a similar error while crunching MultiCoreWUs, so I switched back to SingleCore.

Now, as only MultiCoreWUs are available I have set it to use only 1 core and this seems to work. David has already checked the results and they are fine.

EDIT: May be this processor is really too old


Supporting BOINC, a great concept !
ID: 29705 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,436,849
RAC: 102,955
Message 29706 - Posted: 29 Mar 2017, 11:32:44 UTC - in response to Message 29705.  

Now, as only MultiCoreWUs are available I have set it to use only 1 core and this seems to work.

EDIT: May be this processor is really too old

here too, it works fine with 1-core multicore (using the other 3 cores for other projects)

You're probably right, the Problem may be the too old processor.
ID: 29706 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 29710 - Posted: 29 Mar 2017, 16:03:01 UTC - in response to Message 29686.  

Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ?
<snip>
I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m.

So I bumped one of them to the front of the queue ... and it bombed after only 10mins run time with a crazy looking output log https://lhcathome.cern.ch/lhcathome/result.php?resultid=129069334.

Either somone was having fun when they coded it or else my machine has a nasty stutter and has severely mangled the content of that file! Anyway, back to the "regular" ones for now and I'll get around to the remaining long ones some time tomorrow.

I have a failed tasks to with such a weird looking log file to, but with another task id:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=129219392


My ws is also listed under your WU with an issue (what brought me to this thread)... i have had a lot of tasks come back with validate error the last 24hrs so much so i have since stopped tasks on LHC.

Any task i have downloaded that is more than 4 ish hrs of work seems to run for any amount of time up to an hour but not near the 1.xx days its due to... those that have run all come back with a validate error...
ID: 29710 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,436,849
RAC: 102,955
Message 29727 - Posted: 30 Mar 2017, 15:50:48 UTC

I noticed that there are quite a number of task-IDs around.
Is there any system by which the individual tasks can be characterized in any way on basis of the task-ID?
ID: 29727 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,994,026
RAC: 136,383
Message 29728 - Posted: 30 Mar 2017, 16:05:21 UTC - in response to Message 29727.  

I noticed that there are quite a number of task-IDs around.
Is there any system by which the individual tasks can be characterized in any way on basis of the task-ID?

Problably not what you want but at least an overview.
http://lhcathome.web.cern.ch/projects/atlas
ID: 29728 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29730 - Posted: 30 Mar 2017, 19:42:02 UTC - in response to Message 29728.  

I noticed that there are quite a number of task-IDs around.
Is there any system by which the individual tasks can be characterized in any way on basis of the task-ID?

Problably not what you want but at least an overview.
http://lhcathome.web.cern.ch/projects/atlas


That one is not yet automatically updated. You can see the up to date version still on the ATLAS@Home front page: http://atlasathome.cern.ch/
ID: 29730 · Report as offensive     Reply Quote
Terrible T

Send message
Joined: 1 Nov 05
Posts: 8
Credit: 597,196
RAC: 0
Message 29733 - Posted: 31 Mar 2017, 6:26:41 UTC - in response to Message 29710.  

Also had a lot of validate errors overnight. Almost all with the same msg in the log, all at msg# 11. ( e.g. WU 62921130)

Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-03-31 06:37:41,435 INFO Valgrind not engaged
: PyJobTransforms.trfExe.preExecute 2017-03-31 06:37:41,435 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
: PyJobTransforms.trfExe.execute 2017-03-31 06:37:41,435 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
: PyJobTransforms.trfExe.execute 2017-03-31 06:43:42,116 INFO EVNTtoHITS executor returns 33
: PyJobTransforms.trfExe.validate 2017-03-31 06:43:43,039 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (33) (Error code 65)
: Guest Log: PyJobTransforms.trfExe.validate 2017-03-31 06:43:43,066 INFO Scanning logfile log.EVNTtoHITS for errors
: PyJobTransforms.trfValidation.scanLogFile 2017-03-31 06:43:43,138 WARNING Found message number 11 at level ERROR - this and further messages will be supressed from the report
: PyJobTransforms.transform.execute 2017-03-31 06:43:43,139 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (33); Logfile error in log.EVNTtoHITS: "AtlasFieldSvc FATAL Could not book callback for /GLOBAL/BField/Maps"
: PyJobTransforms.transform.execute 2017-03-31 06:43:46,329 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (33); Logfile error in log.EVNTtoHITS: "AtlasFieldSvc FATAL Could not book callback for /GLOBAL/BField/Maps")


Faulty batch of WU's?
ID: 29733 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,179,424
RAC: 105,469
Message 29739 - Posted: 1 Apr 2017, 8:39:35 UTC

There are upload problems in the old ATLASatHome-Server from Volunteers:

http://atlasathome.cern.ch/forum_thread.php?id=673&postid=6284#6284
ID: 29739 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29846 - Posted: 6 Apr 2017, 17:02:35 UTC

Are there new "very long tasks" in the queue ?

Within console I see event-Nr 134 / 140 / ..., it has alredy 6 hours runtime and claims to need 27 hours to finish

Wouldn't something like this not be a hint in this thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4178 ???


Supporting BOINC, a great concept !
ID: 29846 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29847 - Posted: 6 Apr 2017, 17:45:05 UTC

What taskID do you find in stderr.txt?
ID: 29847 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29864 - Posted: 7 Apr 2017, 12:31:45 UTC - in response to Message 29847.  

Coincidentally I also have one longrunner at the moment at 180 events processed per core, but it's from the longrunner task 10959636. Is it the same for you?
ID: 29864 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS application : Very long tasks in the queue


©2024 CERN