Message boards : ATLAS application : never ending tasks here
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,731,442
RAC: 233,853
Message 28955 - Posted: 22 Feb 2017, 6:04:04 UTC

I still get some never ending tasks, can the task be modified to fail? ideally fast
ID: 28955 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 28962 - Posted: 24 Feb 2017, 6:58:30 UTC

I have received 3 tasks here: https://lhcathome.cern.ch/lhcathome/results.php?userid=444608&offset=0&show_names=0&state=0&appid=14

My setting at max 2 CPUs was correctly taken into account, and the VM size set by the server was 3000 Mbytes.

But this task is never ending: https://lhcathome.cern.ch/lhcathome/result.php?resultid=121226104

Should I abort it?
We are the product of random evolution.
ID: 28962 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,731,442
RAC: 233,853
Message 28964 - Posted: 24 Feb 2017, 7:08:35 UTC

Once they go wrong they never end so yes, you should abort.

Hopefully they can continue to under stand this, as other project don't have this.
ID: 28964 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 28965 - Posted: 24 Feb 2017, 8:32:20 UTC

Yeap, I have aborted the task when I saw that it used only 6 hours of CPU but the task did not end after more than 11 hours running.
ID: 28965 · Report as offensive     Reply Quote
Billy

Send message
Joined: 14 Nov 07
Posts: 3
Credit: 433,980
RAC: 1,210
Message 29027 - Posted: 3 Mar 2017, 4:27:26 UTC - in response to Message 28964.  

I just started here with a Mac. My first task appears to be never ending. Over at Atlas@home I am getting good results. I noticed that the Application here at LHC@home has a different title than Atlas and it seemed to download a Vbox even though I have one already..
ID: 29027 · Report as offensive     Reply Quote
Chris Skull

Send message
Joined: 28 Feb 15
Posts: 6
Credit: 1,261,955
RAC: 0
Message 29039 - Posted: 3 Mar 2017, 10:49:01 UTC
Last modified: 3 Mar 2017, 10:54:35 UTC

i got first 2 units here on LHC.
1 crashes after 34 minutes
2 runs more than 13 hours up to 100% and now never ends.. CPU usage is 1% now.
Both units run with 8 CPU cores...
CPU usage is most of the time < 25%
so its not very efficient to spend 8 cores to atlas :)

8 cores:
Run time 13 hours 56 min 18 sec
CPU time 23 hours 38 min 20 sec
ID: 29039 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 2 Jan 11
Posts: 23
Credit: 5,986,899
RAC: 0
Message 29050 - Posted: 3 Mar 2017, 14:01:08 UTC

Task : 121561738 / 58403859 over 2 days (100.00 %)
ID: 29050 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29051 - Posted: 3 Mar 2017, 14:05:59 UTC - in response to Message 29050.  
Last modified: 3 Mar 2017, 14:06:20 UTC

Task : 121561738 / 58403859 over 2 days (100.00 %)

This task may be dead.

You could work through this checklist, but note, it has been written with SingleCoreWUs, so some MultiCoreSpecificDetails may not be in the list.

For short, you could post CPU-Time versus RUN-Time (Mark the WU in your BOINC-Client and then click Properties)


Supporting BOINC, a great concept !
ID: 29051 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 2 Jan 11
Posts: 23
Credit: 5,986,899
RAC: 0
Message 29064 - Posted: 5 Mar 2017, 7:49:11 UTC

Perhaps I canceled Task 121561738 (58403859) too early ??!!

Another Task : 121561706 (58403831); Runtime: 2d19h15min; CPU: 5d16h8min -> OK

Checklist . . . (VBox with several projekts running for a long time) . . .
ID: 29064 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29068 - Posted: 5 Mar 2017, 10:11:00 UTC - in response to Message 29064.  
Last modified: 5 Mar 2017, 10:11:16 UTC

Perhaps I canceled Task 121561738 (58403859) too early ??!!

My last single core ATLAS-task had a run time of 12.5 hours.
ID: 29068 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29069 - Posted: 5 Mar 2017, 10:13:21 UTC - in response to Message 29064.  
Last modified: 5 Mar 2017, 10:13:44 UTC

Perhaps I canceled Task 121561738 (58403859) too early ??!!

Another Task : 121561706 (58403831); Runtime: 2d19h15min; CPU: 5d16h8min -> OK

Checklist . . . (VBox with several projekts running for a long time) . . .

As long as Runtime and CPU-Time are so proportional, I would let it run

Are you running 2-Core-WUs ?


Supporting BOINC, a great concept !
ID: 29069 · Report as offensive     Reply Quote
peterfilla

Send message
Joined: 2 Jan 11
Posts: 23
Credit: 5,986,899
RAC: 0
Message 29074 - Posted: 5 Mar 2017, 17:27:37 UTC

I try to run 8-core-tasks, but most time 2 cores are used per task and some time 4 - 5 (?).

But my other problem ist, that I can run only 2 tasks at the same time - would like to run 4 to 5 4-core-tasks (so as under the old ATLAS-project).
ID: 29074 · Report as offensive     Reply Quote
Terrible T

Send message
Joined: 1 Nov 05
Posts: 8
Credit: 597,196
RAC: 0
Message 29083 - Posted: 6 Mar 2017, 9:29:25 UTC

Also had (when using computer ) some tasks running longer than expected, .
Will happen around 80% completion, processor load nil.
When either suspending job, or opening VM Virtual Box, the job status in the 'Task' pane will change from 'running' to 'uploading', and task will report succesfull.
(see Task 123141235)

Is the process not able to give an 'file completed' to the VM when using the computer? e.g. too low process priority or similar I/O conflict?
ID: 29083 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29088 - Posted: 6 Mar 2017, 13:28:28 UTC - in response to Message 29083.  

I finally got one of the longrunners myself so I was able to debug it. It had been stuck for the last 2 days using zero CPU. From the log I saw that at some point it failed to allocate memory and after this the process exited but without shutting down the machine properly. I have increased the memory a little, so now the formula is

1.4GB + 1GB * ncores
ID: 29088 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29089 - Posted: 6 Mar 2017, 13:35:42 UTC

I have a new longrunner as SingleCoreWU.

It has run now nearly 5 days, normal would be something up to 1 day.

In difference to your WU, my is still consuming CPU-Power (= one full core).

Any logfile that I could extract during runtime to see what is really going on ?


Supporting BOINC, a great concept !
ID: 29089 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 29091 - Posted: 6 Mar 2017, 14:58:54 UTC - in response to Message 29089.  

On the Linux box I have one double core running 30 hours and is now at 96%. It uses 3000 MB RAM according to Virtual Box Manager and CPU usage is around 170%. On the Windows 10 PC double core Atlas Tasks outside LHC used 4100 MB RAM according to VBox Manager. This CPU is an A10-6700 AMD CPU which should have 4 cores, but the Windows Task Manager sees only 2 cores and 4 logical processors, so multicore Atlas tasks run on two cores.
Tullio
ID: 29091 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 29092 - Posted: 6 Mar 2017, 16:36:07 UTC - in response to Message 29089.  
Last modified: 6 Mar 2017, 16:37:32 UTC

Doing a power down and restarting a bit later (not a reboot) the host may close normaly the work unit.-->and provide credits expected.
It's a way to check if it is worth running the work unit for the very long wus.
If errors inside vm are not detected , the restart might decide with initializing ,the purpose to go on or not.
If the wu goes on , give it a further chance to finish by itself.
If the wu ends , then you don't waste your host's time.

It works for me who have only one task at once.
ID: 29092 · Report as offensive     Reply Quote
Terrible T

Send message
Joined: 1 Nov 05
Posts: 8
Credit: 597,196
RAC: 0
Message 29117 - Posted: 9 Mar 2017, 7:16:51 UTC

Yesterday neverending (looping?) multicore tasks appeared, which just keep running. Have aborted 1 task , stoped 1 through VBox, updated VBox,
still endless loop, see VBox log. Any body an idea?

00:00:50.246422 VMMDev: Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff880310b2d610), OR(0x0), NOT(0xffffffff), flags(0x0)
00:02:16.236076 VMMDev: Guest Log: Copying input files into RunAtlas.
00:02:18.692928 VMMDev: Guest Log: Copied input files into RunAtlas.
00:02:20.860967 VMMDev: Guest Log: copied the webapp to /var/www
00:02:20.950547 VMMDev: Guest Log: This vm does not need to setup http proxy
00:02:21.031455 VMMDev: Guest Log: ATHENA_PROC_NUMBER=11
00:02:21.101961 VMMDev: Guest Log: Starting ATLAS job. (PandaID=3260989220)
00:54:04.894395 VMMDev: Guest Log: Copying input files into RunAtlas.
00:54:06.537649 VMMDev: Guest Log: Copied input files into RunAtlas.
00:54:06.974127 VMMDev: Guest Log: copied the webapp to /var/www
00:54:07.030660 VMMDev: Guest Log: This vm does not need to setup http proxy
00:54:07.079244 VMMDev: Guest Log: ATHENA_PROC_NUMBER=11
00:54:07.166297 VMMDev: Guest Log: Starting ATLAS job. (PandaID=3260989220)
ID: 29117 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,100,795
RAC: 103,685
Message 29118 - Posted: 9 Mar 2017, 7:27:47 UTC

the Atlas-team is searching for this problem.
ID: 29118 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29121 - Posted: 9 Mar 2017, 8:56:10 UTC

See the thread here

Short summary: the problem has been fixed but will take a few hours to propagate. If you keep the jobs running they will exit and you will get the credit.
ID: 29121 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : never ending tasks here


©2024 CERN