Message boards :
ATLAS application :
Very long tasks in the queue
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,436,849 RAC: 102,955 |
You can find info on the events and their processing time on the console as described in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170 thanks, David, I've been using the console quite a lot lately, and it gives valuable Information :-) That's why I was somewhat confused that in BOINC, the estimated time for the tasks is 2+ days (for a 2-core task which normaly runs 4-5 hours). Any idea how come? |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,436,849 RAC: 102,955 |
that's why I was somewhat confused that in BOINC, the estimated time for the tasks is 2+ days (for a 2-core task which normaly runs 4-5 hours). Any idea how come? on another machine, 10 minutes ago another task with taskID=10995522 started. Will most probably be 50 events, right? BOINC Shows 4days+ as remaining time. In reality, this 1-core task will get finished within 5-8 hours. What exactly is it that irritates BOINC to such an extent? |
Send message Joined: 14 Jan 10 Posts: 1268 Credit: 8,421,616 RAC: 2,139 |
on another machine, 10 minutes ago another task with taskID=10995522 started. Will most probably be 50 events, right? 10995522 jobs has 100 events. (When you would have run it on a dual core the 2 cores would each do about 50 events, but depending on the event processing time it also could be 49-51, 48-52 etc.) BOINC Shows 4days+ as remaining time. BOINC server calculates the duration rsc_fpops_est (history of returned tasks from your machine) / p_fpops (your benchmark) On 25 March you had returned a task with 136,526.20 cpu seconds The rsc_fpops_est is very slowly adjusted after returning faster (evt. smaller) tasks. E.g. from my machine: <rsc_fpops_est>1814400000000000</rsc_fpops_est> / <p_fpops>3730968000</p_fpops> makes 4.6 days |
Send message Joined: 14 Jan 10 Posts: 1268 Credit: 8,421,616 RAC: 2,139 |
Not from the 1000-events batch with taskID=10959636, but jobs with taskID=10995517 and 'only' 100 events are also running rather long. On a dual core VM about 7 events done in 2 hours (incl init-phase). Event average 1490 seconds. Expected runtime on 2nd generation i7 >21 hours. BOINC calculates 106 hours. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,436,849 RAC: 102,955 |
jobs with taskID=10995517 and 'only' 100 events are also running rather long. same here; on the machine where the old QuadCore Q9550 processor has a problem with 2-core tasks, I am trying a 3-core Task (out of curiosity), it has taskID=10995517. And, to my big surprise, it's been running well for 3:15 hrs. now; the console shows 10 task processed. According to the Windows Task Manager, 3 cores are being used, and RAM usage is accordingly. I'll keep my fingers crossed for this 3-core taks, but no 2-core task before has run more than about 15 minutes before failing (no CPU use suddenly, no RAM use). So it would be very strange that this PC (with the old processor) can run a 3-core task, but NOT a 2-core task. |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,179,424 RAC: 105,469 |
3 Core are running automaticly with 4400 MByte RAM. 2 Core and one Core need a app_config.xml with 4.400 MByte RAM. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,026 RAC: 136,383 |
3 Core are running automaticly with 4400 MByte RAM. My 1 core WUs run with 3400 MB by default since the last project reset. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,436,849 RAC: 102,955 |
I'll keep my fingers crossed for this 3-core taks, but no 2-core task before has run more than about 15 minutes before failing (no CPU use suddenly, no RAM use). Unfortunately, the 3-core task did not work out either. When I saw that only 2 out of 3 cores are utilized, I opened the console and saw the following: Obviously, showing the image here does not work, so here is the URL of the image: http://workupload.com/file/APXqjhw Can anyone tell from the console what the problem is? |
Send message Joined: 14 Jan 10 Posts: 1268 Credit: 8,421,616 RAC: 2,139 |
I'm not an expert, but obvious something wrong with virtual memory mapping. |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,369,412 RAC: 10,065 |
HM, I have also one PC that has always crashed with a similar error while crunching MultiCoreWUs, so I switched back to SingleCore. Now, as only MultiCoreWUs are available I have set it to use only 1 core and this seems to work. David has already checked the results and they are fine. EDIT: May be this processor is really too old Supporting BOINC, a great concept ! |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,436,849 RAC: 102,955 |
Now, as only MultiCoreWUs are available I have set it to use only 1 core and this seems to work. here too, it works fine with 1-core multicore (using the other 3 cores for other projects) You're probably right, the Problem may be the too old processor. |
Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0 |
Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? My ws is also listed under your WU with an issue (what brought me to this thread)... i have had a lot of tasks come back with validate error the last 24hrs so much so i have since stopped tasks on LHC. Any task i have downloaded that is more than 4 ish hrs of work seems to run for any amount of time up to an hour but not near the 1.xx days its due to... those that have run all come back with a validate error... |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,436,849 RAC: 102,955 |
I noticed that there are quite a number of task-IDs around. Is there any system by which the individual tasks can be characterized in any way on basis of the task-ID? |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,994,026 RAC: 136,383 |
I noticed that there are quite a number of task-IDs around. Problably not what you want but at least an overview. http://lhcathome.web.cern.ch/projects/atlas |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
I noticed that there are quite a number of task-IDs around. That one is not yet automatically updated. You can see the up to date version still on the ATLAS@Home front page: http://atlasathome.cern.ch/ |
Send message Joined: 1 Nov 05 Posts: 8 Credit: 597,196 RAC: 0 |
Also had a lot of validate errors overnight. Almost all with the same msg in the log, all at msg# 11. ( e.g. WU 62921130) Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-03-31 06:37:41,435 INFO Valgrind not engaged Faulty batch of WU's? |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,179,424 RAC: 105,469 |
There are upload problems in the old ATLASatHome-Server from Volunteers: http://atlasathome.cern.ch/forum_thread.php?id=673&postid=6284#6284 |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,369,412 RAC: 10,065 |
Are there new "very long tasks" in the queue ? Within console I see event-Nr 134 / 140 / ..., it has alredy 6 hours runtime and claims to need 27 hours to finish Wouldn't something like this not be a hint in this thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4178 ??? Supporting BOINC, a great concept ! |
Send message Joined: 14 Jan 10 Posts: 1268 Credit: 8,421,616 RAC: 2,139 |
What taskID do you find in stderr.txt? |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Coincidentally I also have one longrunner at the moment at 180 events processed per core, but it's from the longrunner task 10959636. Is it the same for you? |
©2024 CERN