Very long tasks in the queue

Author	Message
HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29470 - Posted: 21 Mar 2017, 3:51:22 UTC I found a WU that failed once with EXIT_DISK_LIMIT_EXCEEDED and then executed OK on one of my machines: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60711170 On the first machine it ran with 6-core, 7600 MB memory, and failed with Peak disk usage 6,579.56 MB. Then on my machine it ran with 2-core, 4400 MB memory and finished OK with Peak disk usage 3,394.09 MB. What could explain a different disk usage for the same WU? We are the product of random evolution. ID: 29470 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1471 Credit: 9,931,173 RAC: 1,389	Message 29471 - Posted: 21 Mar 2017, 7:03:44 UTC - in response to Message 29466. I have 4 that failed with EXIT_DISK_LIMIT_EXCEEDED 5-7GB It's a bit irritating though that in the log it shows HITS file and: The limit of 1 ATLAS-task to occupy disk space in the working slot directory is set to 6,000,000,000 bytes. Suspending a running task with LAIM off or shutting down BOINC client will create a snapshot into the slot directory to restore from after the task resumes. I did a test to look how big that snapshot could be: 1,937,670,144 bytes. Together with the VM-vdi file and all other files the slot contains 5,545,902,080 bytes. So I think the project has to increase the rsc_disk_bound. For my 4 running tasks I increased that size manually to 10,000,000,000 to avoid the disk limit exceeding making the task crash. ID: 29471 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1471 Credit: 9,931,173 RAC: 1,389	Message 29472 - Posted: 21 Mar 2017, 7:36:19 UTC - in response to Message 29457. My resend task started as a single core VM and I'll try to restart it with 4 cores. How do you do that? I mean restarting a single core task with a different number of cores? It's only usefull when the task was running not that long, cause work done will be lost. - Suspend the task in BOINC with "Leave applications in memory" (LAIM) set to off. - The VM will save the state. Disgard with VirtualBox Manager the saved state. - Change the settings with VirtualBox Manager. In my example cpu's to 4 and memory to 4400MB or more. - Start the VM with VirtualBox Manager and let it run for about 10 minutes. - Shutdown the VM whilst saving the state. - Resume the task with BOINC Manager. ID: 29472 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29475 - Posted: 21 Mar 2017, 9:05:12 UTC - in response to Message 29471. So I think the project has to increase the rsc_disk_bound. Thanks, we'll bear that in mind when we have the dedicated longrunners app. I don't want to increase the limit in general for ATLAS because it can limit how many tasks people can run, and 6GB seems ok for the "shortrunners". ID: 29475 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,319,078 RAC: 146,984	Message 29491 - Posted: 21 Mar 2017, 14:25:05 UTC - in response to Message 29276. To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. I now have tasks with taskID=10995520. This number is not shown in what you are saying above. What typ of task is this now? ID: 29491 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 981	Message 29492 - Posted: 21 Mar 2017, 14:32:40 UTC - in response to Message 29491. To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. I now have tasks with taskID=10995520. This number is not shown in what you are saying above. What typ of task is this now? https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170&postid=29477#29477 Supporting BOINC, a great concept ! ID: 29492 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 888 Credit: 762,350,809 RAC: 309,568	Message 29497 - Posted: 21 Mar 2017, 16:58:46 UTC Some more ran successfully today so they seem like there OK as long as I don't close BOINC ;) ID: 29497 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29504 - Posted: 21 Mar 2017, 19:15:27 UTC Last modified: 21 Mar 2017, 19:15:56 UTC It's only usefull when the task was running not that long, cause work done will be lost. - Suspend the task in BOINC with "Leave applications in memory" (LAIM) set to off. - The VM will save the state. Disgard with VirtualBox Manager the saved state. - Change the settings with VirtualBox Manager. In my example cpu's to 4 and memory to 4400MB or more. - Start the VM with VirtualBox Manager and let it run for about 10 minutes. - Shutdown the VM whilst saving the state. - Resume the task with BOINC Manager. Many thanks Crystal. So I think the project has to increase the rsc_disk_bound. I have had the same problem of exceeded memory on one longrunner. How can I change that parameter locally? We are the product of random evolution. ID: 29504 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1471 Credit: 9,931,173 RAC: 1,389	Message 29509 - Posted: 21 Mar 2017, 20:53:05 UTC - in response to Message 29504. So I think the project has to increase the rsc_disk_bound. I have had the same problem of exceeded memory on one longrunner. How can I change that parameter locally? You probably refer to "For my 4 running tasks I increased that size manually to 10,000,000,000 to avoid the disk limit exceeding making the task crash.". This is a bit tricky. First you have to shutdown BOINC. 1st obstacle: When you have several VM-tasks running they all have to save their states to disk within 60 seconds, else a VM could not be saved properly and will get the stopped state. That VM will not resume properly. In best case it will start from scratch. To avoid this obstacle: Before stopping BOINC, suspend the tasks (LAIM off) one after another, so each VM get the time to save the state to disk. 2nd obstacle: When saving the VM, the working slot directory could already exceed the disk bound :( 3rd obstacle: You have to replace (with a basic editor - e.g. notepad) in client_state.xml the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> from the workunits to a higher value. ID: 29509 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29510 - Posted: 21 Mar 2017, 21:08:50 UTC 3rd obstacle: You have to replace (with a basic editor - e.g. notepad) in client_state.xml the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> from the workunits to a higher value. OK, I did that. Let's see how it goes with the one longrunner I still have :). We are the product of random evolution. ID: 29510 · Reply Quote

Darrell Send message Joined: 8 Jul 08 Posts: 23 Credit: 40,874,720 RAC: 33,352	Message 29542 - Posted: 22 Mar 2017, 23:39:54 UTC - in response to Message 29298. @ Yeti: I think NOT DONE! I am running my first 16 core Atlas with an estimated runtime of 1hr 20min, and it is already at 13hr 47min. ID: 29542 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 215,197,406 RAC: 981	Message 29546 - Posted: 23 Mar 2017, 6:38:00 UTC - in response to Message 29542. Darrell wrote: @ Yeti: I think NOT DONE! What is not done ? Darrell wrote: I am running my first 16 core Atlas with an estimated runtime of 1hr 20min, and it is already at 13hr 47min. 1) Did you set 16-cores with App_config.xml ? 2) 16-Core will be very unefficient 3) I would guess that the WU is already dead, but you could check it with this: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170&postid=29473#29473 Supporting BOINC, a great concept ! ID: 29546 · Reply Quote

Dave Peachey Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0	Message 29683 - Posted: 27 Mar 2017, 19:55:45 UTC Last modified: 27 Mar 2017, 20:06:24 UTC Morning/afternoon/evening all, Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? Looking at the reworked version of the old Atlas@Home home page (http://lhcathome.web.cern.ch/projects/atlas - which is a very useful page!), these are designated as Task mc16_13TeV DP2500_3000.simul (11016767) with 0/643 in progress - although I expect that's a tad out of date. I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m. My normal run time (running in 4-core multi-thread mode) is on the order of 1hr20m to 1hr30m for the tasks with ID 10995522/25/28 (which BOINC has now come to recognise along with the regular long running 10995515/17 tasks (clocking in at a between 4hr and 4hr30m). However, as these are new (at least to my BOINC installation), it could just be that BOINC doesn't know what to make of them (yet). Does anyone know if these are a variation on the fabled 1000-event WUs (TaskID=10959636) or just seriously long runners of a different sort? I suspect that if I knew how to correctly interpret the above task string I'd be able to answer my own question. I'd venture that "13TeV" is the energy in terra-electron volts but I don't understand the "DP2500_3000" part. Perhaps this is something on which David C could enlighten us in his "Information on ATLAS tasks" sticky thread? Cheers Dave ID: 29683 · Reply Quote

Dave Peachey Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0	Message 29684 - Posted: 27 Mar 2017, 21:47:59 UTC - in response to Message 29683. Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? <snip> I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m. So I bumped one of them to the front of the queue ... and it bombed after only 10mins run time with a crazy looking output log https://lhcathome.cern.ch/lhcathome/result.php?resultid=129069334. Either somone was having fun when they coded it or else my machine has a nasty stutter and has severely mangled the content of that file! Anyway, back to the "regular" ones for now and I'll get around to the remaining long ones some time tomorrow. ID: 29684 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,319,078 RAC: 146,984	Message 29685 - Posted: 28 Mar 2017, 5:47:00 UTC In contrast to what David told us some time ago (Task ID for the "longrunners" = 10959636), they now seem to have Task ID 10995522. At least this is shown with 4 tasks which one of my boxes received last night. Each task (2-core) has run for about 4 hours so far, 46 tasks finished as seen on the VM console, and remaining runtime 2 days 7 hours, as predicted by the BOINC Manager. I personally am not too fond of such long tasks, and would be glad if I could opt them out in the settings on the homepage. Question for David: will this Feature be introduced one day? ID: 29685 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,561,307 RAC: 1,486	Message 29686 - Posted: 28 Mar 2017, 6:09:30 UTC - in response to Message 29684. Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? <snip> I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m. So I bumped one of them to the front of the queue ... and it bombed after only 10mins run time with a crazy looking output log https://lhcathome.cern.ch/lhcathome/result.php?resultid=129069334. Either somone was having fun when they coded it or else my machine has a nasty stutter and has severely mangled the content of that file! Anyway, back to the "regular" ones for now and I'll get around to the remaining long ones some time tomorrow. I have a failed tasks to with such a weird looking log file to, but with another task id: https://lhcathome.cern.ch/lhcathome/result.php?resultid=129219392 ID: 29686 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,319,078 RAC: 146,984	Message 29687 - Posted: 28 Mar 2017, 6:11:28 UTC - in response to Message 29685. ...At least this is shown with 4 tasks which one of my boxes received last night. Each task (2-core) has run for about 4 hours so far, 46 tasks finished as seen on the VM console, and remaining runtime 2 days 7 hours, as predicted by the BOINC Manager. 2 of these tasks just finished (after 4+ hours) and got validated properly. So, obviously some data regarding the their length which they deliver to the BOINC Manager is wrong. ID: 29687 · Reply Quote

Dave Peachey Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0	Message 29688 - Posted: 28 Mar 2017, 6:46:48 UTC - in response to Message 29685. Last modified: 28 Mar 2017, 6:51:25 UTC In contrast to what David told us some time ago (Task ID for the "longrunners" = 10959636), they now seem to have Task ID 10995522. Erich, I would call the 10995522 tasks and their run times "normal" for me given they usually only run for 4hrs or so of total CPU time - which equates to around 1hr20m of elapsed time when running as a 4-core task and which is what I'm used to seeing on my machine. Although I too am now seeing BOINC Manager getting confused as to their anticipated length (as are you) which raises another question about what might be going on here. As of this morning, however, all of my anticipated "long runners" with TaskID=11016767 have errored out over night (with a "Validate error" message) in times ranging from 10mins to 17mins. Indeed, I've also had several 10995522, 25 or 28 tasks do the same thing and/or not produce a HITS file. It hasn't been a complete failure - I have had some of these run to completion and validation - but my error rate has increased alarmingly to eight of the last twenty WUs I've tried to process so I wonder whether there is an external factor involved here (given my machine has previouly been rock solid on these tasks). I wonder if David C can shed any light on this? Dave ID: 29688 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1923 Credit: 150,319,078 RAC: 146,984	Message 29689 - Posted: 28 Mar 2017, 7:04:35 UTC - in response to Message 29688. As of this morning, however, all of my anticipated "long runners" with TaskID=11016767 have errored out over night (with a "Validate error" message) in times ranging from 10mins to 17mins. Here too, some tasks errored out last night. Their Task IDs vary: taskID=10995530; taskID=11016767 Obviously, the former Information we received from David: 10947180 = "normal runs" 10959636 = "long runs" is obsolete. ID: 29689 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29691 - Posted: 28 Mar 2017, 7:39:11 UTC All current tasks except the "longrunners" task 10959636 (of which there are around 40 WU left) process 100 events. However the time to process each event can vary per task - I have seen between 100 and 1200 seconds per event. You can find info on the events and their processing time on the console as described in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170 If you see progress in the console then the task is good and it's worth letting it run. ID: 29691 · Reply Quote

LHC@home