Message boards : ATLAS application : Very long tasks in the queue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29470 - Posted: 21 Mar 2017, 3:51:22 UTC

I found a WU that failed once with EXIT_DISK_LIMIT_EXCEEDED and then executed OK on one of my machines: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60711170
On the first machine it ran with 6-core, 7600 MB memory, and failed with Peak disk usage 6,579.56 MB. Then on my machine it ran with 2-core, 4400 MB memory and finished OK with Peak disk usage 3,394.09 MB.
What could explain a different disk usage for the same WU?
We are the product of random evolution.
ID: 29470 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29471 - Posted: 21 Mar 2017, 7:03:44 UTC - in response to Message 29466.  

I have 4 that failed with EXIT_DISK_LIMIT_EXCEEDED 5-7GB

It's a bit irritating though that in the log it shows HITS file and:

The limit of 1 ATLAS-task to occupy disk space in the working slot directory is set to 6,000,000,000 bytes.

Suspending a running task with LAIM off or shutting down BOINC client will create a snapshot into the slot directory to restore from after the task resumes.
I did a test to look how big that snapshot could be: 1,937,670,144 bytes.
Together with the VM-vdi file and all other files the slot contains 5,545,902,080 bytes.
So I think the project has to increase the rsc_disk_bound.
For my 4 running tasks I increased that size manually to 10,000,000,000 to avoid the disk limit exceeding making the task crash.
ID: 29471 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29472 - Posted: 21 Mar 2017, 7:36:19 UTC - in response to Message 29457.  

My resend task started as a single core VM and I'll try to restart it with 4 cores.

How do you do that? I mean restarting a single core task with a different number of cores?

It's only usefull when the task was running not that long, cause work done will be lost.
- Suspend the task in BOINC with "Leave applications in memory" (LAIM) set to off.
- The VM will save the state. Disgard with VirtualBox Manager the saved state.
- Change the settings with VirtualBox Manager. In my example cpu's to 4 and memory to 4400MB or more.
- Start the VM with VirtualBox Manager and let it run for about 10 minutes.
- Shutdown the VM whilst saving the state.
- Resume the task with BOINC Manager.
ID: 29472 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29475 - Posted: 21 Mar 2017, 9:05:12 UTC - in response to Message 29471.  

So I think the project has to increase the rsc_disk_bound.


Thanks, we'll bear that in mind when we have the dedicated longrunners app. I don't want to increase the limit in general for ATLAS because it can limit how many tasks people can run, and 6GB seems ok for the "shortrunners".
ID: 29475 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,339,479
RAC: 101,863
Message 29491 - Posted: 21 Mar 2017, 14:25:05 UTC - in response to Message 29276.  

To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.

I now have tasks with taskID=10995520.

This number is not shown in what you are saying above. What typ of task is this now?
ID: 29491 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29492 - Posted: 21 Mar 2017, 14:32:40 UTC - in response to Message 29491.  

To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.

I now have tasks with taskID=10995520.

This number is not shown in what you are saying above. What typ of task is this now?

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170&postid=29477#29477


Supporting BOINC, a great concept !
ID: 29492 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,684,135
RAC: 235,456
Message 29497 - Posted: 21 Mar 2017, 16:58:46 UTC

Some more ran successfully today so they seem like there OK as long as I don't close BOINC ;)
ID: 29497 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29504 - Posted: 21 Mar 2017, 19:15:27 UTC
Last modified: 21 Mar 2017, 19:15:56 UTC

It's only usefull when the task was running not that long, cause work done will be lost.
- Suspend the task in BOINC with "Leave applications in memory" (LAIM) set to off.
- The VM will save the state. Disgard with VirtualBox Manager the saved state.
- Change the settings with VirtualBox Manager. In my example cpu's to 4 and memory to 4400MB or more.
- Start the VM with VirtualBox Manager and let it run for about 10 minutes.
- Shutdown the VM whilst saving the state.
- Resume the task with BOINC Manager.
Many thanks Crystal.

So I think the project has to increase the rsc_disk_bound.
I have had the same problem of exceeded memory on one longrunner. How can I change that parameter locally?
We are the product of random evolution.
ID: 29504 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 29509 - Posted: 21 Mar 2017, 20:53:05 UTC - in response to Message 29504.  

So I think the project has to increase the rsc_disk_bound.
I have had the same problem of exceeded memory on one longrunner. How can I change that parameter locally?

You probably refer to "For my 4 running tasks I increased that size manually to 10,000,000,000 to avoid the disk limit exceeding making the task crash.".
This is a bit tricky. First you have to shutdown BOINC.

1st obstacle: When you have several VM-tasks running they all have to save their states to disk within 60 seconds, else a VM could not be saved properly and will get the stopped state.
That VM will not resume properly. In best case it will start from scratch.
To avoid this obstacle: Before stopping BOINC, suspend the tasks (LAIM off) one after another, so each VM get the time to save the state to disk.

2nd obstacle: When saving the VM, the working slot directory could already exceed the disk bound :(

3rd obstacle: You have to replace (with a basic editor - e.g. notepad) in client_state.xml the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> from the workunits to a higher value.
ID: 29509 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29510 - Posted: 21 Mar 2017, 21:08:50 UTC

3rd obstacle: You have to replace (with a basic editor - e.g. notepad) in client_state.xml the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> from the workunits to a higher value.
OK, I did that. Let's see how it goes with the one longrunner I still have :).
We are the product of random evolution.
ID: 29510 · Report as offensive     Reply Quote
Darrell

Send message
Joined: 8 Jul 08
Posts: 20
Credit: 25,935,687
RAC: 17,897
Message 29542 - Posted: 22 Mar 2017, 23:39:54 UTC - in response to Message 29298.  

@ Yeti:

I think NOT DONE! I am running my first 16 core Atlas with an estimated runtime of 1hr 20min, and it is already at 13hr 47min.
ID: 29542 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29546 - Posted: 23 Mar 2017, 6:38:00 UTC - in response to Message 29542.  

Darrell wrote:
@ Yeti:

I think NOT DONE!


What is not done ?

Darrell wrote:
I am running my first 16 core Atlas with an estimated runtime of 1hr 20min, and it is already at 13hr 47min.

1) Did you set 16-cores with App_config.xml ?
2) 16-Core will be very unefficient
3) I would guess that the WU is already dead, but you could check it with this: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170&postid=29473#29473


Supporting BOINC, a great concept !
ID: 29546 · Report as offensive     Reply Quote
Dave Peachey

Send message
Joined: 9 May 09
Posts: 17
Credit: 772,975
RAC: 0
Message 29683 - Posted: 27 Mar 2017, 19:55:45 UTC
Last modified: 27 Mar 2017, 20:06:24 UTC

Morning/afternoon/evening all,

Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ? Looking at the reworked version of the old Atlas@Home home page (http://lhcathome.web.cern.ch/projects/atlas - which is a very useful page!), these are designated as
Task mc16_13TeV DP2500_3000.simul (11016767) with 0/643 in progress - although I expect that's a tad out of date.

I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m. My normal run time (running in 4-core multi-thread mode) is on the order of 1hr20m to 1hr30m for the tasks with ID 10995522/25/28 (which BOINC has now come to recognise along with the regular long running 10995515/17 tasks (clocking in at a between 4hr and 4hr30m). However, as these are new (at least to my BOINC installation), it could just be that BOINC doesn't know what to make of them (yet).

Does anyone know if these are a variation on the fabled 1000-event WUs (TaskID=10959636) or just seriously long runners of a different sort?

I suspect that if I knew how to correctly interpret the above task string I'd be able to answer my own question. I'd venture that "13TeV" is the energy in terra-electron volts but I don't understand the "DP2500_3000" part. Perhaps this is something on which David C could enlighten us in his "Information on ATLAS tasks" sticky thread?

Cheers
Dave
ID: 29683 · Report as offensive     Reply Quote
Dave Peachey

Send message
Joined: 9 May 09
Posts: 17
Credit: 772,975
RAC: 0
Message 29684 - Posted: 27 Mar 2017, 21:47:59 UTC - in response to Message 29683.  

Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ?
<snip>
I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m.

So I bumped one of them to the front of the queue ... and it bombed after only 10mins run time with a crazy looking output log https://lhcathome.cern.ch/lhcathome/result.php?resultid=129069334.

Either somone was having fun when they coded it or else my machine has a nasty stutter and has severely mangled the content of that file! Anyway, back to the "regular" ones for now and I'll get around to the remaining long ones some time tomorrow.
ID: 29684 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,339,479
RAC: 101,863
Message 29685 - Posted: 28 Mar 2017, 5:47:00 UTC

In contrast to what David told us some time ago (Task ID for the "longrunners" = 10959636), they now seem to have Task ID 10995522.

At least this is shown with 4 tasks which one of my boxes received last night.
Each task (2-core) has run for about 4 hours so far, 46 tasks finished as seen on the VM console, and remaining runtime 2 days 7 hours, as predicted by the BOINC Manager.

I personally am not too fond of such long tasks, and would be glad if I could opt them out in the settings on the homepage.
Question for David: will this Feature be introduced one day?
ID: 29685 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 29686 - Posted: 28 Mar 2017, 6:09:30 UTC - in response to Message 29684.  

Has anyone encountered, or had experiences with, any of the WUs with TaskID=11016767 ?
<snip>
I seem to have acquired four of them and the BOINC estimated completion time has hit the roof at 13hr28m.

So I bumped one of them to the front of the queue ... and it bombed after only 10mins run time with a crazy looking output log https://lhcathome.cern.ch/lhcathome/result.php?resultid=129069334.

Either somone was having fun when they coded it or else my machine has a nasty stutter and has severely mangled the content of that file! Anyway, back to the "regular" ones for now and I'll get around to the remaining long ones some time tomorrow.

I have a failed tasks to with such a weird looking log file to, but with another task id:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=129219392
ID: 29686 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,339,479
RAC: 101,863
Message 29687 - Posted: 28 Mar 2017, 6:11:28 UTC - in response to Message 29685.  

...At least this is shown with 4 tasks which one of my boxes received last night.
Each task (2-core) has run for about 4 hours so far, 46 tasks finished as seen on the VM console, and remaining runtime 2 days 7 hours, as predicted by the BOINC Manager.

2 of these tasks just finished (after 4+ hours) and got validated properly.

So, obviously some data regarding the their length which they deliver to the BOINC Manager is wrong.
ID: 29687 · Report as offensive     Reply Quote
Dave Peachey

Send message
Joined: 9 May 09
Posts: 17
Credit: 772,975
RAC: 0
Message 29688 - Posted: 28 Mar 2017, 6:46:48 UTC - in response to Message 29685.  
Last modified: 28 Mar 2017, 6:51:25 UTC

In contrast to what David told us some time ago (Task ID for the "longrunners" = 10959636), they now seem to have Task ID 10995522.

Erich,

I would call the 10995522 tasks and their run times "normal" for me given they usually only run for 4hrs or so of total CPU time - which equates to around 1hr20m of elapsed time when running as a 4-core task and which is what I'm used to seeing on my machine. Although I too am now seeing BOINC Manager getting confused as to their anticipated length (as are you) which raises another question about what might be going on here.

As of this morning, however, all of my anticipated "long runners" with TaskID=11016767 have errored out over night (with a "Validate error" message) in times ranging from 10mins to 17mins. Indeed, I've also had several 10995522, 25 or 28 tasks do the same thing and/or not produce a HITS file.

It hasn't been a complete failure - I have had some of these run to completion and validation - but my error rate has increased alarmingly to eight of the last twenty WUs I've tried to process so I wonder whether there is an external factor involved here (given my machine has previouly been rock solid on these tasks).

I wonder if David C can shed any light on this?

Dave
ID: 29688 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,339,479
RAC: 101,863
Message 29689 - Posted: 28 Mar 2017, 7:04:35 UTC - in response to Message 29688.  

As of this morning, however, all of my anticipated "long runners" with TaskID=11016767 have errored out over night (with a "Validate error" message) in times ranging from 10mins to 17mins.

Here too, some tasks errored out last night. Their Task IDs vary:
taskID=10995530; taskID=11016767

Obviously, the former Information we received from David:
10947180 = "normal runs"
10959636 = "long runs"
is obsolete.
ID: 29689 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29691 - Posted: 28 Mar 2017, 7:39:11 UTC

All current tasks except the "longrunners" task 10959636 (of which there are around 40 WU left) process 100 events. However the time to process each event can vary per task - I have seen between 100 and 1200 seconds per event.

You can find info on the events and their processing time on the console as described in this thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4170

If you see progress in the console then the task is good and it's worth letting it run.
ID: 29691 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS application : Very long tasks in the queue


©2024 CERN