Message boards :
ATLAS application :
Very long tasks in the queue
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
I found a WU that failed once with EXIT_DISK_LIMIT_EXCEEDED and then executed OK on one of my machines: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60711170 On the first machine it ran with 6-core, 7600 MB memory, and failed with Peak disk usage 6,579.56 MB. Then on my machine it ran with 2-core, 4400 MB memory and finished OK with Peak disk usage 3,394.09 MB. What could explain a different disk usage for the same WU? We are the product of random evolution. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
I have 4 that failed with EXIT_DISK_LIMIT_EXCEEDED 5-7GB The limit of 1 ATLAS-task to occupy disk space in the working slot directory is set to 6,000,000,000 bytes. Suspending a running task with LAIM off or shutting down BOINC client will create a snapshot into the slot directory to restore from after the task resumes. I did a test to look how big that snapshot could be: 1,937,670,144 bytes. Together with the VM-vdi file and all other files the slot contains 5,545,902,080 bytes. So I think the project has to increase the rsc_disk_bound. For my 4 running tasks I increased that size manually to 10,000,000,000 to avoid the disk limit exceeding making the task crash. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
My resend task started as a single core VM and I'll try to restart it with 4 cores. It's only usefull when the task was running not that long, cause work done will be lost. - Suspend the task in BOINC with "Leave applications in memory" (LAIM) set to off. - The VM will save the state. Disgard with VirtualBox Manager the saved state. - Change the settings with VirtualBox Manager. In my example cpu's to 4 and memory to 4400MB or more. - Start the VM with VirtualBox Manager and let it run for about 10 minutes. - Shutdown the VM whilst saving the state. - Resume the task with BOINC Manager. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 ![]() ![]() |
So I think the project has to increase the rsc_disk_bound. Thanks, we'll bear that in mind when we have the dedicated longrunners app. I don't want to increase the limit in general for ATLAS because it can limit how many tasks people can run, and 6GB seems ok for the "shortrunners". |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,951,293 RAC: 82,039 ![]() ![]() ![]() |
To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. I now have tasks with taskID=10995520. This number is not shown in what you are saying above. What typ of task is this now? |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,945,248 RAC: 46,614 ![]() ![]() |
To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170&postid=29477#29477 ![]() Supporting BOINC, a great concept ! |
Send message Joined: 27 Sep 08 Posts: 880 Credit: 746,983,875 RAC: 325,500 ![]() ![]() ![]() |
Some more ran successfully today so they seem like there OK as long as I don't close BOINC ;) |
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
It's only usefull when the task was running not that long, cause work done will be lost.Many thanks Crystal. So I think the project has to increase the rsc_disk_bound.I have had the same problem of exceeded memory on one longrunner. How can I change that parameter locally? We are the product of random evolution. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
So I think the project has to increase the rsc_disk_bound.I have had the same problem of exceeded memory on one longrunner. How can I change that parameter locally? You probably refer to "For my 4 running tasks I increased that size manually to 10,000,000,000 to avoid the disk limit exceeding making the task crash.". This is a bit tricky. First you have to shutdown BOINC. 1st obstacle: When you have several VM-tasks running they all have to save their states to disk within 60 seconds, else a VM could not be saved properly and will get the stopped state. That VM will not resume properly. In best case it will start from scratch. To avoid this obstacle: Before stopping BOINC, suspend the tasks (LAIM off) one after another, so each VM get the time to save the state to disk. 2nd obstacle: When saving the VM, the working slot directory could already exceed the disk bound :( 3rd obstacle: You have to replace (with a basic editor - e.g. notepad) in client_state.xml the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> from the workunits to a higher value. |
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
3rd obstacle: You have to replace (with a basic editor - e.g. notepad) in client_state.xml the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> from the workunits to a higher value.OK, I did that. Let's see how it goes with the one longrunner I still have :). We are the product of random evolution. |
Send message Joined: 8 Jul 08 Posts: 23 Credit: 38,826,180 RAC: 33,739 ![]() ![]() |
@ Yeti: I think NOT DONE! I am running my first 16 core Atlas with an estimated runtime of 1hr 20min, and it is already at 13hr 47min. |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,945,248 RAC: 46,614 ![]() ![]() |
Darrell wrote: @ Yeti: What is not done ? Darrell wrote: I am running my first 16 core Atlas with an estimated runtime of 1hr 20min, and it is already at 13hr 47min. 1) Did you set 16-cores with App_config.xml ? 2) 16-Core will be very unefficient 3) I would guess that the WU is already dead, but you could check it with this: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170&postid=29473#29473 ![]() Supporting BOINC, a great concept ! |
Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0 ![]() ![]() |
Morning/afternoon/evening all, Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? Looking at the reworked version of the old Atlas@Home home page (http://lhcathome.web.cern.ch/projects/atlas - which is a very useful page!), these are designated as Task mc16_13TeV DP2500_3000.simul (11016767) with 0/643 in progress - although I expect that's a tad out of date. I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m. My normal run time (running in 4-core multi-thread mode) is on the order of 1hr20m to 1hr30m for the tasks with ID 10995522/25/28 (which BOINC has now come to recognise along with the regular long running 10995515/17 tasks (clocking in at a between 4hr and 4hr30m). However, as these are new (at least to my BOINC installation), it could just be that BOINC doesn't know what to make of them (yet). Does anyone know if these are a variation on the fabled 1000-event WUs (TaskID=10959636) or just seriously long runners of a different sort? I suspect that if I knew how to correctly interpret the above task string I'd be able to answer my own question. I'd venture that "13TeV" is the energy in terra-electron volts but I don't understand the "DP2500_3000" part. Perhaps this is something on which David C could enlighten us in his "Information on ATLAS tasks" sticky thread? Cheers Dave |
Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0 ![]() ![]() |
Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? So I bumped one of them to the front of the queue ... and it bombed after only 10mins run time with a crazy looking output log https://lhcathome.cern.ch/lhcathome/result.php?resultid=129069334. Either somone was having fun when they coded it or else my machine has a nasty stutter and has severely mangled the content of that file! Anyway, back to the "regular" ones for now and I'll get around to the remaining long ones some time tomorrow. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,951,293 RAC: 82,039 ![]() ![]() ![]() |
In contrast to what David told us some time ago (Task ID for the "longrunners" = 10959636), they now seem to have Task ID 10995522. At least this is shown with 4 tasks which one of my boxes received last night. Each task (2-core) has run for about 4 hours so far, 46 tasks finished as seen on the VM console, and remaining runtime 2 days 7 hours, as predicted by the BOINC Manager. I personally am not too fond of such long tasks, and would be glad if I could opt them out in the settings on the homepage. Question for David: will this Feature be introduced one day? |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,539,793 RAC: 175 ![]() ![]() |
Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? I have a failed tasks to with such a weird looking log file to, but with another task id: https://lhcathome.cern.ch/lhcathome/result.php?resultid=129219392 |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,951,293 RAC: 82,039 ![]() ![]() ![]() |
...At least this is shown with 4 tasks which one of my boxes received last night. 2 of these tasks just finished (after 4+ hours) and got validated properly. So, obviously some data regarding the their length which they deliver to the BOINC Manager is wrong. |
Send message Joined: 9 May 09 Posts: 17 Credit: 772,975 RAC: 0 ![]() ![]() |
In contrast to what David told us some time ago (Task ID for the "longrunners" = 10959636), they now seem to have Task ID 10995522. Erich, I would call the 10995522 tasks and their run times "normal" for me given they usually only run for 4hrs or so of total CPU time - which equates to around 1hr20m of elapsed time when running as a 4-core task and which is what I'm used to seeing on my machine. Although I too am now seeing BOINC Manager getting confused as to their anticipated length (as are you) which raises another question about what might be going on here. As of this morning, however, all of my anticipated "long runners" with TaskID=11016767 have errored out over night (with a "Validate error" message) in times ranging from 10mins to 17mins. Indeed, I've also had several 10995522, 25 or 28 tasks do the same thing and/or not produce a HITS file. It hasn't been a complete failure - I have had some of these run to completion and validation - but my error rate has increased alarmingly to eight of the last twenty WUs I've tried to process so I wonder whether there is an external factor involved here (given my machine has previouly been rock solid on these tasks). I wonder if David C can shed any light on this? Dave |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,951,293 RAC: 82,039 ![]() ![]() ![]() |
As of this morning, however, all of my anticipated "long runners" with TaskID=11016767 have errored out over night (with a "Validate error" message) in times ranging from 10mins to 17mins. Here too, some tasks errored out last night. Their Task IDs vary: taskID=10995530; taskID=11016767 Obviously, the former Information we received from David: 10947180 = "normal runs" 10959636 = "long runs" is obsolete. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 ![]() ![]() |
All current tasks except the "longrunners" task 10959636 (of which there are around 40 WU left) process 100 events. However the time to process each event can vary per task - I have seen between 100 and 1200 seconds per event. You can find info on the events and their processing time on the console as described in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170 If you see progress in the console then the task is good and it's worth letting it run. |
©2025 CERN