log in

Very long tasks in the queue


Advanced search

Message boards : ATLAS application : Very long tasks in the queue

1 · 2 · 3 · 4 . . . 6 · Next
Author Message
David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 139
Credit: 3,159,531
RAC: 6,484
Message 29276 - Posted: 14 Mar 2017, 21:47:58 UTC

The current ATLAS tasks process 100 events, but as an experiment we have sent some tasks with 1000 events. We would like to see if it's possible to run tasks like these on ATLAS@Home because this is the same number of events each task processes on the ATLAS grid. It would make things a lot easier if the same tasks could run on ATLAS@Home as on the rest of the ATLAS grid.

These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time.

To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.

Please let us know your opinion in general about the length and data in/out requirements of ATLAS tasks. They are usually much shorter than the other vbox LHC projects - is this a good thing or would you prefer more consistency among the projects?

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 120
Credit: 6,749,027
RAC: 20,218
Message 29281 - Posted: 15 Mar 2017, 1:59:09 UTC

We would like to see if it's possible to run tasks like these on ATLAS@Home

David, will these tasks be run both at ATLAS@Home and LHC@Home, or only at ATLAS@Home?
____________
We are the product of random evolution.

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 139
Credit: 3,159,531
RAC: 6,484
Message 29284 - Posted: 15 Mar 2017, 8:28:56 UTC - in response to Message 29281.

We would like to see if it's possible to run tasks like these on ATLAS@Home

David, will these tasks be run both at ATLAS@Home and LHC@Home, or only at ATLAS@Home?


Only on LHC@Home. I know it's a bit confusing during this transition phase...

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 29286 - Posted: 15 Mar 2017, 9:03:13 UTC

These tasks will run 10 times longer ...

This makes feedback/discussion more difficult as it takes very long to get a result. Makes sense if the WUs run nearly 100% reliable.

... output file 10 times as large (500MB) ...

Does this mean additional 450 MB RAM during runtime?
A couple of users are already fighting to fulfil the RAM requirements for the current WUs.

The advantage is that the initial download of 200MB is the same.

+1

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29297 - Posted: 15 Mar 2017, 11:39:06 UTC - in response to Message 29286.

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?

great idea

suggestions:

* make a second subproject Atlas1000 or AtlasLongRunners or something similar
* hand these WUs only to PCs that had a minimum number of succesfull WUs
____________


Supporting BOINC, a great concept !

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29298 - Posted: 15 Mar 2017, 11:45:10 UTC - in response to Message 29276.

These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time.

To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.

Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ?
____________


Supporting BOINC, a great concept !

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 139
Credit: 3,159,531
RAC: 6,484
Message 29301 - Posted: 15 Mar 2017, 13:53:54 UTC

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?


Yes that's a good idea. I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects.

The long tasks I put in the queue are just a proof of concept to see if it's possible at all.

Does this mean additional 450 MB RAM during runtime?


No, the output events are stored in files are they are produced so the whole result is not in memory.

Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ?


No, but it's a good point. The deadline is 2 weeks which should still be long enough. I guess the FLOPS are used to provide the estimated time, so these tasks will definitely run over and will stay at 99.999% completed for a long time. In future the FLOPS should be automatically set by the ATLAS systems generating the tasks.

Profile rbpeake
Send message
Joined: 17 Sep 04
Posts: 55
Credit: 15,620,725
RAC: 3
Message 29303 - Posted: 15 Mar 2017, 15:18:55 UTC - in response to Message 29301.

I think it's a great idea!
____________
Regards,
Bob P.

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 120
Credit: 6,749,027
RAC: 20,218
Message 29306 - Posted: 15 Mar 2017, 16:50:09 UTC

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?

+1
____________
We are the product of random evolution.

m
Send message
Joined: 6 Sep 08
Posts: 72
Credit: 3,626,193
RAC: 2,361
Message 29313 - Posted: 15 Mar 2017, 20:53:19 UTC - in response to Message 29301.

....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects....

Does this mean that you have/will get the native app to run on Ubuntu or will we need another distribution, if so which one will be best?

Toby Broom
Volunteer moderator
Send message
Joined: 27 Sep 08
Posts: 376
Credit: 88,664,173
RAC: 174,222
Message 29314 - Posted: 15 Mar 2017, 21:04:19 UTC

How would I know if it hung or long task, I normally abort them if they are 1 day and 99%

Profile rbpeake
Send message
Joined: 17 Sep 04
Posts: 55
Credit: 15,620,725
RAC: 3
Message 29315 - Posted: 15 Mar 2017, 21:20:23 UTC - in response to Message 29314.

How would I know if it hung or long task, I normally abort them if they are 1 day and 99%

From the initial post:
To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29318 - Posted: 16 Mar 2017, 6:53:56 UTC

I have got two of these longrunners.

I have checked the WUs in my local queue(s) with this little DOS-Command:

findstr 10959636 \\PHuW10\x$\BOINC_Data\_00\projects\lhcathome.cern.ch_lhcathome\boinc_job_script.*

replace the bold part with the path to your BOINC_DATA Directory

If you get some output you seem to have one or more of these longrunners.
____________


Supporting BOINC, a great concept !

David Cameron
Project administrator
Project developer
Project scientist
Send message
Joined: 13 May 14
Posts: 139
Credit: 3,159,531
RAC: 6,484
Message 29319 - Posted: 16 Mar 2017, 8:16:27 UTC - in response to Message 29313.

....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects....

Does this mean that you have/will get the native app to run on Ubuntu or will we need another distribution, if so which one will be best?


As you saw it was not straightforward to run on Ubuntu out of the box. I am not sure at this moment how much work it would take to make it work. The best supported distributions are flavours of RHEL6 - this is what currently runs inside the ATLAS VM and most sites in the ATLAS grid. I was able to run the native app on CentOS7 no problem, so any recent version of RHEL/Fedora would probably work ok.

One the migration to LHC is complete I hope to go back to working on the native app.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,809
RAC: 2,011
Message 29322 - Posted: 16 Mar 2017, 10:07:34 UTC

I got a long runner too: 2017-03-16 10:54:17 (CET) (7684): Guest Log: Starting ATLAS job. (PandaID=3283615871 taskID=10959636)

Running on 4 cores with 4400MB RAM. Just on time I decided not to switch back to single core ;)
The 'normal' task with that configuration needed 3h41m wall clock (CPU: 11 hours 38 min 38 sec), so this one should take about 37 hours elapsed on my machine.

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29324 - Posted: 16 Mar 2017, 10:10:42 UTC

My Longrunners show a 12 - 24 hour runtime, that's good

Thx David for fixing this
____________


Supporting BOINC, a great concept !

tullio
Send message
Joined: 19 Feb 08
Posts: 449
Credit: 2,076,809
RAC: 332
Message 29329 - Posted: 16 Mar 2017, 14:36:09 UTC
Last modified: 16 Mar 2017, 14:36:58 UTC

I got one with 75 hours estimated time on my 2 core Opteron 1210 running Linux. VirtualBox says 3600 MB. Most Atlas tasks validate on my Linux box and are invalid on the Windows 10 PC.
Tullio

PHILIPPE
Send message
Joined: 24 Jul 16
Posts: 65
Credit: 128,142
RAC: 434
Message 29331 - Posted: 16 Mar 2017, 16:48:53 UTC - in response to Message 29329.

Hi , Tullio ,

2017-03-16 10:30:54 (6184): Setting Memory Size for VM. (3600MB)
2017-03-16 10:30:54 (6184): Setting CPU Count for VM. (2)

crystal pellet and HerveUAE said for 2-core wus , the vm needs near 4400MBytes Ram to run successfully.
You have only allocated 3600 MBytes.Try to increase the ram allocated, with an app_config.xml.You have enough ram memory to do it.
It may fix the issue.

tullio
Send message
Joined: 19 Feb 08
Posts: 449
Credit: 2,076,809
RAC: 332
Message 29332 - Posted: 16 Mar 2017, 16:51:58 UTC - in response to Message 29331.

Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM.
Tullio

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,809
RAC: 2,011
Message 29333 - Posted: 16 Mar 2017, 17:04:48 UTC - in response to Message 29332.

Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM.
Tullio

I think there is something wrong with the validator for the Linux tasks.
No one of your valid tasks on your Linux box displays the HITS*.root result file of about 60MB for upload.
IMO those tasks can't be valid.

1 · 2 · 3 · 4 . . . 6 · Next

Message boards : ATLAS application : Very long tasks in the queue