Message boards : ATLAS application : Very long tasks in the queue
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 154
Credit: 3,472,372
RAC: 879
Message 29276 - Posted: 14 Mar 2017, 21:47:58 UTC

The current ATLAS tasks process 100 events, but as an experiment we have sent some tasks with 1000 events. We would like to see if it's possible to run tasks like these on ATLAS@Home because this is the same number of events each task processes on the ATLAS grid. It would make things a lot easier if the same tasks could run on ATLAS@Home as on the rest of the ATLAS grid.

These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time.

To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.

Please let us know your opinion in general about the length and data in/out requirements of ATLAS tasks. They are usually much shorter than the other vbox LHC projects - is this a good thing or would you prefer more consistency among the projects?
ID: 29276 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 120
Credit: 7,224,211
RAC: 3,941
Message 29281 - Posted: 15 Mar 2017, 1:59:09 UTC

We would like to see if it's possible to run tasks like these on ATLAS@Home

David, will these tasks be run both at ATLAS@Home and LHC@Home, or only at ATLAS@Home?
We are the product of random evolution.
ID: 29281 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 154
Credit: 3,472,372
RAC: 879
Message 29284 - Posted: 15 Mar 2017, 8:28:56 UTC - in response to Message 29281.  

We would like to see if it's possible to run tasks like these on ATLAS@Home

David, will these tasks be run both at ATLAS@Home and LHC@Home, or only at ATLAS@Home?


Only on LHC@Home. I know it's a bit confusing during this transition phase...
ID: 29284 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 15 Jun 08
Posts: 426
Credit: 4,434,292
RAC: 10,149
Message 29286 - Posted: 15 Mar 2017, 9:03:13 UTC

These tasks will run 10 times longer ...

This makes feedback/discussion more difficult as it takes very long to get a result. Makes sense if the WUs run nearly 100% reliable.

... output file 10 times as large (500MB) ...

Does this mean additional 450 MB RAM during runtime?
A couple of users are already fighting to fulfil the RAM requirements for the current WUs.

The advantage is that the initial download of 200MB is the same.

+1

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?
ID: 29286 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 315
Credit: 43,501,608
RAC: 30,641
Message 29297 - Posted: 15 Mar 2017, 11:39:06 UTC - in response to Message 29286.  

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?

great idea

suggestions:

* make a second subproject Atlas1000 or AtlasLongRunners or something similar
* hand these WUs only to PCs that had a minimum number of succesfull WUs


Supporting BOINC, a great concept !
ID: 29297 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 315
Credit: 43,501,608
RAC: 30,641
Message 29298 - Posted: 15 Mar 2017, 11:45:10 UTC - in response to Message 29276.  

These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time.

To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.

Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ?


Supporting BOINC, a great concept !
ID: 29298 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 154
Credit: 3,472,372
RAC: 879
Message 29301 - Posted: 15 Mar 2017, 13:53:54 UTC

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?


Yes that's a good idea. I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects.

The long tasks I put in the queue are just a proof of concept to see if it's possible at all.

Does this mean additional 450 MB RAM during runtime?


No, the output events are stored in files are they are produced so the whole result is not in memory.

Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ?


No, but it's a good point. The deadline is 2 weeks which should still be long enough. I guess the FLOPS are used to provide the estimated time, so these tasks will definitely run over and will stay at 99.999% completed for a long time. In future the FLOPS should be automatically set by the ATLAS systems generating the tasks.
ID: 29301 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 56
Credit: 15,810,379
RAC: 11,775
Message 29303 - Posted: 15 Mar 2017, 15:18:55 UTC - in response to Message 29301.  

I think it's a great idea!
Regards,
Bob P.
ID: 29303 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 120
Credit: 7,224,211
RAC: 3,941
Message 29306 - Posted: 15 Mar 2017, 16:50:09 UTC

What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in?

+1
We are the product of random evolution.
ID: 29306 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 80
Credit: 3,862,029
RAC: 3,005
Message 29313 - Posted: 15 Mar 2017, 20:53:19 UTC - in response to Message 29301.  

....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects....

Does this mean that you have/will get the native app to run on Ubuntu or will we need another distribution, if so which one will be best?
ID: 29313 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 404
Credit: 101,367,716
RAC: 97,146
Message 29314 - Posted: 15 Mar 2017, 21:04:19 UTC

How would I know if it hung or long task, I normally abort them if they are 1 day and 99%
ID: 29314 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 56
Credit: 15,810,379
RAC: 11,775
Message 29315 - Posted: 15 Mar 2017, 21:20:23 UTC - in response to Message 29314.  

How would I know if it hung or long task, I normally abort them if they are 1 day and 99%

From the initial post:
To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180.
ID: 29315 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 315
Credit: 43,501,608
RAC: 30,641
Message 29318 - Posted: 16 Mar 2017, 6:53:56 UTC

I have got two of these longrunners.

I have checked the WUs in my local queue(s) with this little DOS-Command:

findstr 10959636 \\PHuW10\x$\BOINC_Data\_00\projects\lhcathome.cern.ch_lhcathome\boinc_job_script.*

replace the bold part with the path to your BOINC_DATA Directory

If you get some output you seem to have one or more of these longrunners.


Supporting BOINC, a great concept !
ID: 29318 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 154
Credit: 3,472,372
RAC: 879
Message 29319 - Posted: 16 Mar 2017, 8:16:27 UTC - in response to Message 29313.  

....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects....

Does this mean that you have/will get the native app to run on Ubuntu or will we need another distribution, if so which one will be best?


As you saw it was not straightforward to run on Ubuntu out of the box. I am not sure at this moment how much work it would take to make it work. The best supported distributions are flavours of RHEL6 - this is what currently runs inside the ATLAS VM and most sites in the ATLAS grid. I was able to run the native app on CentOS7 no problem, so any recent version of RHEL/Fedora would probably work ok.

One the migration to LHC is complete I hope to go back to working on the native app.
ID: 29319 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 413
Credit: 3,106,364
RAC: 688
Message 29322 - Posted: 16 Mar 2017, 10:07:34 UTC

I got a long runner too: 2017-03-16 10:54:17 (CET) (7684): Guest Log: Starting ATLAS job. (PandaID=3283615871 taskID=10959636)

Running on 4 cores with 4400MB RAM. Just on time I decided not to switch back to single core ;)
The 'normal' task with that configuration needed 3h41m wall clock (CPU: 11 hours 38 min 38 sec), so this one should take about 37 hours elapsed on my machine.
ID: 29322 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 315
Credit: 43,501,608
RAC: 30,641
Message 29324 - Posted: 16 Mar 2017, 10:10:42 UTC

My Longrunners show a 12 - 24 hour runtime, that's good

Thx David for fixing this


Supporting BOINC, a great concept !
ID: 29324 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 472
Credit: 2,126,871
RAC: 303
Message 29329 - Posted: 16 Mar 2017, 14:36:09 UTC
Last modified: 16 Mar 2017, 14:36:58 UTC

I got one with 75 hours estimated time on my 2 core Opteron 1210 running Linux. VirtualBox says 3600 MB. Most Atlas tasks validate on my Linux box and are invalid on the Windows 10 PC.
Tullio
ID: 29329 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 77
Credit: 161,617
RAC: 373
Message 29331 - Posted: 16 Mar 2017, 16:48:53 UTC - in response to Message 29329.  

Hi , Tullio ,
2017-03-16 10:30:54 (6184): Setting Memory Size for VM. (3600MB)
2017-03-16 10:30:54 (6184): Setting CPU Count for VM. (2)

crystal pellet and HerveUAE said for 2-core wus , the vm needs near 4400MBytes Ram to run successfully.
You have only allocated 3600 MBytes.Try to increase the ram allocated, with an app_config.xml.You have enough ram memory to do it.
It may fix the issue.
ID: 29331 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 472
Credit: 2,126,871
RAC: 303
Message 29332 - Posted: 16 Mar 2017, 16:51:58 UTC - in response to Message 29331.  

Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM.
Tullio
ID: 29332 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 413
Credit: 3,106,364
RAC: 688
Message 29333 - Posted: 16 Mar 2017, 17:04:48 UTC - in response to Message 29332.  

Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM.
Tullio

I think there is something wrong with the validator for the Linux tasks.
No one of your valid tasks on your Linux box displays the HITS*.root result file of about 60MB for upload.
IMO those tasks can't be valid.
ID: 29333 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : ATLAS application : Very long tasks in the queue


©2018 CERN