log in

never ending tasks here


Advanced search

Message boards : ATLAS application : never ending tasks here

1 · 2 · Next
Author Message
Toby Broom
Volunteer moderator
Send message
Joined: 27 Sep 08
Posts: 376
Credit: 88,664,173
RAC: 174,222
Message 28955 - Posted: 22 Feb 2017, 6:04:04 UTC

I still get some never ending tasks, can the task be modified to fail? ideally fast

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 120
Credit: 6,749,027
RAC: 20,218
Message 28962 - Posted: 24 Feb 2017, 6:58:30 UTC

I have received 3 tasks here: https://lhcathome.cern.ch/lhcathome/results.php?userid=444608&offset=0&show_names=0&state=0&appid=14

My setting at max 2 CPUs was correctly taken into account, and the VM size set by the server was 3000 Mbytes.

But this task is never ending: https://lhcathome.cern.ch/lhcathome/result.php?resultid=121226104

Should I abort it?
____________
We are the product of random evolution.

Toby Broom
Volunteer moderator
Send message
Joined: 27 Sep 08
Posts: 376
Credit: 88,664,173
RAC: 174,222
Message 28964 - Posted: 24 Feb 2017, 7:08:35 UTC

Once they go wrong they never end so yes, you should abort.

Hopefully they can continue to under stand this, as other project don't have this.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 120
Credit: 6,749,027
RAC: 20,218
Message 28965 - Posted: 24 Feb 2017, 8:32:20 UTC

Yeap, I have aborted the task when I saw that it used only 6 hours of CPU but the task did not end after more than 11 hours running.

Billy
Send message
Joined: 14 Nov 07
Posts: 2
Credit: 58,571
RAC: 603
Message 29027 - Posted: 3 Mar 2017, 4:27:26 UTC - in response to Message 28964.

I just started here with a Mac. My first task appears to be never ending. Over at Atlas@home I am getting good results. I noticed that the Application here at LHC@home has a different title than Atlas and it seemed to download a Vbox even though I have one already..

Chris Skull
Send message
Joined: 28 Feb 15
Posts: 6
Credit: 1,101,735
RAC: 4,445
Message 29039 - Posted: 3 Mar 2017, 10:49:01 UTC
Last modified: 3 Mar 2017, 10:54:35 UTC

i got first 2 units here on LHC.
1 crashes after 34 minutes
2 runs more than 13 hours up to 100% and now never ends.. CPU usage is 1% now.
Both units run with 8 CPU cores...
CPU usage is most of the time < 25%
so its not very efficient to spend 8 cores to atlas :)

8 cores:
Run time 13 hours 56 min 18 sec
CPU time 23 hours 38 min 20 sec

peterfilla
Send message
Joined: 2 Jan 11
Posts: 22
Credit: 4,292,127
RAC: 7,418
Message 29050 - Posted: 3 Mar 2017, 14:01:08 UTC

Task : 121561738 / 58403859 over 2 days (100.00 %)

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29051 - Posted: 3 Mar 2017, 14:05:59 UTC - in response to Message 29050.
Last modified: 3 Mar 2017, 14:06:20 UTC

Task : 121561738 / 58403859 over 2 days (100.00 %)

This task may be dead.

You could work through this checklist, but note, it has been written with SingleCoreWUs, so some MultiCoreSpecificDetails may not be in the list.

For short, you could post CPU-Time versus RUN-Time (Mark the WU in your BOINC-Client and then click Properties)
____________


Supporting BOINC, a great concept !

peterfilla
Send message
Joined: 2 Jan 11
Posts: 22
Credit: 4,292,127
RAC: 7,418
Message 29064 - Posted: 5 Mar 2017, 7:49:11 UTC

Perhaps I canceled Task 121561738 (58403859) too early ??!!

Another Task : 121561706 (58403831); Runtime: 2d19h15min; CPU: 5d16h8min -> OK

Checklist . . . (VBox with several projekts running for a long time) . . .

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,809
RAC: 2,011
Message 29068 - Posted: 5 Mar 2017, 10:11:00 UTC - in response to Message 29064.
Last modified: 5 Mar 2017, 10:11:16 UTC

Perhaps I canceled Task 121561738 (58403859) too early ??!!

My last single core ATLAS-task had a run time of 12.5 hours.

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29069 - Posted: 5 Mar 2017, 10:13:21 UTC - in response to Message 29064.
Last modified: 5 Mar 2017, 10:13:44 UTC

Perhaps I canceled Task 121561738 (58403859) too early ??!!

Another Task : 121561706 (58403831); Runtime: 2d19h15min; CPU: 5d16h8min -> OK

Checklist . . . (VBox with several projekts running for a long time) . . .

As long as Runtime and CPU-Time are so proportional, I would let it run

Are you running 2-Core-WUs ?
____________


Supporting BOINC, a great concept !

peterfilla
Send message
Joined: 2 Jan 11
Posts: 22
Credit: 4,292,127
RAC: 7,418
Message 29074 - Posted: 5 Mar 2017, 17:27:37 UTC

I try to run 8-core-tasks, but most time 2 cores are used per task and some time 4 - 5 (?).

But my other problem ist, that I can run only 2 tasks at the same time - would like to run 4 to 5 4-core-tasks (so as under the old ATLAS-project).

Terrible T
Send message
Joined: 1 Nov 05
Posts: 8
Credit: 461,098
RAC: 0
Message 29083 - Posted: 6 Mar 2017, 9:29:25 UTC

Also had (when using computer ) some tasks running longer than expected, .
Will happen around 80% completion, processor load nil.
When either suspending job, or opening VM Virtual Box, the job status in the 'Task' pane will change from 'running' to 'uploading', and task will report succesfull.
(see Task 123141235)

Is the process not able to give an 'file completed' to the VM when using the computer? e.g. too low process priority or similar I/O conflict?
____________

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 139
Credit: 3,159,531
RAC: 6,484
Message 29088 - Posted: 6 Mar 2017, 13:28:28 UTC - in response to Message 29083.

I finally got one of the longrunners myself so I was able to debug it. It had been stuck for the last 2 days using zero CPU. From the log I saw that at some point it failed to allocate memory and after this the process exited but without shutting down the machine properly. I have increased the memory a little, so now the formula is

1.4GB + 1GB * ncores

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29089 - Posted: 6 Mar 2017, 13:35:42 UTC

I have a new longrunner as SingleCoreWU.

It has run now nearly 5 days, normal would be something up to 1 day.

In difference to your WU, my is still consuming CPU-Power (= one full core).

Any logfile that I could extract during runtime to see what is really going on ?
____________


Supporting BOINC, a great concept !

tullio
Send message
Joined: 19 Feb 08
Posts: 449
Credit: 2,076,809
RAC: 332
Message 29091 - Posted: 6 Mar 2017, 14:58:54 UTC - in response to Message 29089.

On the Linux box I have one double core running 30 hours and is now at 96%. It uses 3000 MB RAM according to Virtual Box Manager and CPU usage is around 170%. On the Windows 10 PC double core Atlas Tasks outside LHC used 4100 MB RAM according to VBox Manager. This CPU is an A10-6700 AMD CPU which should have 4 cores, but the Windows Task Manager sees only 2 cores and 4 logical processors, so multicore Atlas tasks run on two cores.
Tullio

PHILIPPE
Send message
Joined: 24 Jul 16
Posts: 65
Credit: 128,142
RAC: 434
Message 29092 - Posted: 6 Mar 2017, 16:36:07 UTC - in response to Message 29089.
Last modified: 6 Mar 2017, 16:37:32 UTC

Doing a power down and restarting a bit later (not a reboot) the host may close normaly the work unit.-->and provide credits expected.
It's a way to check if it is worth running the work unit for the very long wus.
If errors inside vm are not detected , the restart might decide with initializing ,the purpose to go on or not.
If the wu goes on , give it a further chance to finish by itself.
If the wu ends , then you don't waste your host's time.

It works for me who have only one task at once.

Terrible T
Send message
Joined: 1 Nov 05
Posts: 8
Credit: 461,098
RAC: 0
Message 29117 - Posted: 9 Mar 2017, 7:16:51 UTC

Yesterday neverending (looping?) multicore tasks appeared, which just keep running. Have aborted 1 task , stoped 1 through VBox, updated VBox,
still endless loop, see VBox log. Any body an idea?

00:00:50.246422 VMMDev: Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff880310b2d610), OR(0x0), NOT(0xffffffff), flags(0x0)
00:02:16.236076 VMMDev: Guest Log: Copying input files into RunAtlas.
00:02:18.692928 VMMDev: Guest Log: Copied input files into RunAtlas.
00:02:20.860967 VMMDev: Guest Log: copied the webapp to /var/www
00:02:20.950547 VMMDev: Guest Log: This vm does not need to setup http proxy
00:02:21.031455 VMMDev: Guest Log: ATHENA_PROC_NUMBER=11
00:02:21.101961 VMMDev: Guest Log: Starting ATLAS job. (PandaID=3260989220)
00:54:04.894395 VMMDev: Guest Log: Copying input files into RunAtlas.
00:54:06.537649 VMMDev: Guest Log: Copied input files into RunAtlas.
00:54:06.974127 VMMDev: Guest Log: copied the webapp to /var/www
00:54:07.030660 VMMDev: Guest Log: This vm does not need to setup http proxy
00:54:07.079244 VMMDev: Guest Log: ATHENA_PROC_NUMBER=11
00:54:07.166297 VMMDev: Guest Log: Starting ATLAS job. (PandaID=3260989220)
____________

maeax
Send message
Joined: 2 May 07
Posts: 232
Credit: 11,993,210
RAC: 14,363
Message 29118 - Posted: 9 Mar 2017, 7:27:47 UTC

the Atlas-team is searching for this problem.

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 139
Credit: 3,159,531
RAC: 6,484
Message 29121 - Posted: 9 Mar 2017, 8:56:10 UTC

See the thread here

Short summary: the problem has been fixed but will take a few hours to propagate. If you keep the jobs running they will exit and you will get the credit.

1 · 2 · Next

Message boards : ATLAS application : never ending tasks here