Thread 'never ending tasks here'

Author	Message
Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 922 Credit: 779,617,330 RAC: 133,896	Message 28955 - Posted: 22 Feb 2017, 6:04:04 UTC I still get some never ending tasks, can the task be modified to fail? ideally fast ID: 28955 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 28962 - Posted: 24 Feb 2017, 6:58:30 UTC I have received 3 tasks here: https://lhcathome.cern.ch/lhcathome/results.php?userid=444608&offset=0&show_names=0&state=0&appid=14 My setting at max 2 CPUs was correctly taken into account, and the VM size set by the server was 3000 Mbytes. But this task is never ending: https://lhcathome.cern.ch/lhcathome/result.php?resultid=121226104 Should I abort it? We are the product of random evolution. ID: 28962 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 922 Credit: 779,617,330 RAC: 133,896	Message 28964 - Posted: 24 Feb 2017, 7:08:35 UTC Once they go wrong they never end so yes, you should abort. Hopefully they can continue to under stand this, as other project don't have this. ID: 28964 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 28965 - Posted: 24 Feb 2017, 8:32:20 UTC Yeap, I have aborted the task when I saw that it used only 6 hours of CPU but the task did not end after more than 11 hours running. ID: 28965 · Reply Quote

Billy Send message Joined: 14 Nov 07 Posts: 3 Credit: 472,959 RAC: 0	Message 29027 - Posted: 3 Mar 2017, 4:27:26 UTC - in response to Message 28964. I just started here with a Mac. My first task appears to be never ending. Over at Atlas@home I am getting good results. I noticed that the Application here at LHC@home has a different title than Atlas and it seemed to download a Vbox even though I have one already.. ID: 29027 · Reply Quote

Chris Skull Send message Joined: 28 Feb 15 Posts: 6 Credit: 1,261,955 RAC: 0	Message 29039 - Posted: 3 Mar 2017, 10:49:01 UTC Last modified: 3 Mar 2017, 10:54:35 UTC i got first 2 units here on LHC. 1 crashes after 34 minutes 2 runs more than 13 hours up to 100% and now never ends.. CPU usage is 1% now. Both units run with 8 CPU cores... CPU usage is most of the time < 25% so its not very efficient to spend 8 cores to atlas :) 8 cores: Run time 13 hours 56 min 18 sec CPU time 23 hours 38 min 20 sec ID: 29039 · Reply Quote

peterfilla Send message Joined: 2 Jan 11 Posts: 23 Credit: 5,986,899 RAC: 0	Message 29050 - Posted: 3 Mar 2017, 14:01:08 UTC Task : 121561738 / 58403859 over 2 days (100.00 %) ID: 29050 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,607,540 RAC: 21,854	Message 29051 - Posted: 3 Mar 2017, 14:05:59 UTC - in response to Message 29050. Last modified: 3 Mar 2017, 14:06:20 UTC Task : 121561738 / 58403859 over 2 days (100.00 %) This task may be dead. You could work through this checklist, but note, it has been written with SingleCoreWUs, so some MultiCoreSpecificDetails may not be in the list. For short, you could post CPU-Time versus RUN-Time (Mark the WU in your BOINC-Client and then click Properties) Supporting BOINC, a great concept ! ID: 29051 · Reply Quote

peterfilla Send message Joined: 2 Jan 11 Posts: 23 Credit: 5,986,899 RAC: 0	Message 29064 - Posted: 5 Mar 2017, 7:49:11 UTC Perhaps I canceled Task 121561738 (58403859) too early ??!! Another Task : 121561706 (58403831); Runtime: 2d19h15min; CPU: 5d16h8min -> OK Checklist . . . (VBox with several projekts running for a long time) . . . ID: 29064 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1540 Credit: 10,049,638 RAC: 1,414	Message 29068 - Posted: 5 Mar 2017, 10:11:00 UTC - in response to Message 29064. Last modified: 5 Mar 2017, 10:11:16 UTC Perhaps I canceled Task 121561738 (58403859) too early ??!! My last single core ATLAS-task had a run time of 12.5 hours. ID: 29068 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,607,540 RAC: 21,854	Message 29069 - Posted: 5 Mar 2017, 10:13:21 UTC - in response to Message 29064. Last modified: 5 Mar 2017, 10:13:44 UTC Perhaps I canceled Task 121561738 (58403859) too early ??!! Another Task : 121561706 (58403831); Runtime: 2d19h15min; CPU: 5d16h8min -> OK Checklist . . . (VBox with several projekts running for a long time) . . . As long as Runtime and CPU-Time are so proportional, I would let it run Are you running 2-Core-WUs ? Supporting BOINC, a great concept ! ID: 29069 · Reply Quote

peterfilla Send message Joined: 2 Jan 11 Posts: 23 Credit: 5,986,899 RAC: 0	Message 29074 - Posted: 5 Mar 2017, 17:27:37 UTC I try to run 8-core-tasks, but most time 2 cores are used per task and some time 4 - 5 (?). But my other problem ist, that I can run only 2 tasks at the same time - would like to run 4 to 5 4-core-tasks (so as under the old ATLAS-project). ID: 29074 · Reply Quote

Terrible T Send message Joined: 1 Nov 05 Posts: 8 Credit: 597,196 RAC: 0	Message 29083 - Posted: 6 Mar 2017, 9:29:25 UTC Also had (when using computer ) some tasks running longer than expected, . Will happen around 80% completion, processor load nil. When either suspending job, or opening VM Virtual Box, the job status in the 'Task' pane will change from 'running' to 'uploading', and task will report succesfull. (see Task 123141235) Is the process not able to give an 'file completed' to the VM when using the computer? e.g. too low process priority or similar I/O conflict? ID: 29083 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29088 - Posted: 6 Mar 2017, 13:28:28 UTC - in response to Message 29083. I finally got one of the longrunners myself so I was able to debug it. It had been stuck for the last 2 days using zero CPU. From the log I saw that at some point it failed to allocate memory and after this the process exited but without shutting down the machine properly. I have increased the memory a little, so now the formula is 1.4GB + 1GB * ncores ID: 29088 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,607,540 RAC: 21,854	Message 29089 - Posted: 6 Mar 2017, 13:35:42 UTC I have a new longrunner as SingleCoreWU. It has run now nearly 5 days, normal would be something up to 1 day. In difference to your WU, my is still consuming CPU-Power (= one full core). Any logfile that I could extract during runtime to see what is really going on ? Supporting BOINC, a great concept ! ID: 29089 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 29091 - Posted: 6 Mar 2017, 14:58:54 UTC - in response to Message 29089. On the Linux box I have one double core running 30 hours and is now at 96%. It uses 3000 MB RAM according to Virtual Box Manager and CPU usage is around 170%. On the Windows 10 PC double core Atlas Tasks outside LHC used 4100 MB RAM according to VBox Manager. This CPU is an A10-6700 AMD CPU which should have 4 cores, but the Windows Task Manager sees only 2 cores and 4 logical processors, so multicore Atlas tasks run on two cores. Tullio ID: 29091 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 29092 - Posted: 6 Mar 2017, 16:36:07 UTC - in response to Message 29089. Last modified: 6 Mar 2017, 16:37:32 UTC Doing a power down and restarting a bit later (not a reboot) the host may close normaly the work unit.-->and provide credits expected. It's a way to check if it is worth running the work unit for the very long wus. If errors inside vm are not detected , the restart might decide with initializing ,the purpose to go on or not. If the wu goes on , give it a further chance to finish by itself. If the wu ends , then you don't waste your host's time. It works for me who have only one task at once. ID: 29092 · Reply Quote

Terrible T Send message Joined: 1 Nov 05 Posts: 8 Credit: 597,196 RAC: 0	Message 29117 - Posted: 9 Mar 2017, 7:16:51 UTC Yesterday neverending (looping?) multicore tasks appeared, which just keep running. Have aborted 1 task , stoped 1 through VBox, updated VBox, still endless loop, see VBox log. Any body an idea? 00:00:50.246422 VMMDev: Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff880310b2d610), OR(0x0), NOT(0xffffffff), flags(0x0) 00:02:16.236076 VMMDev: Guest Log: Copying input files into RunAtlas. 00:02:18.692928 VMMDev: Guest Log: Copied input files into RunAtlas. 00:02:20.860967 VMMDev: Guest Log: copied the webapp to /var/www 00:02:20.950547 VMMDev: Guest Log: This vm does not need to setup http proxy 00:02:21.031455 VMMDev: Guest Log: ATHENA_PROC_NUMBER=11 00:02:21.101961 VMMDev: Guest Log: Starting ATLAS job. (PandaID=3260989220) 00:54:04.894395 VMMDev: Guest Log: Copying input files into RunAtlas. 00:54:06.537649 VMMDev: Guest Log: Copied input files into RunAtlas. 00:54:06.974127 VMMDev: Guest Log: copied the webapp to /var/www 00:54:07.030660 VMMDev: Guest Log: This vm does not need to setup http proxy 00:54:07.079244 VMMDev: Guest Log: ATHENA_PROC_NUMBER=11 00:54:07.166297 VMMDev: Guest Log: Starting ATLAS job. (PandaID=3260989220) ID: 29117 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2289 Credit: 178,856,613 RAC: 2,046	Message 29118 - Posted: 9 Mar 2017, 7:27:47 UTC the Atlas-team is searching for this problem. ID: 29118 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29121 - Posted: 9 Mar 2017, 8:56:10 UTC See the thread here Short summary: the problem has been fixed but will take a few hours to propagate. If you keep the jobs running they will exit and you will get the credit. ID: 29121 · Reply Quote