Message boards :
ATLAS application :
Very long tasks in the queue
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 May 14 Posts: 379 Credit: 15,284,486 RAC: 7,095 ![]() ![]() ![]() |
The current ATLAS tasks process 100 events, but as an experiment we have sent some tasks with 1000 events. We would like to see if it's possible to run tasks like these on ATLAS@Home because this is the same number of events each task processes on the ATLAS grid. It would make things a lot easier if the same tasks could run on ATLAS@Home as on the rest of the ATLAS grid. These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time. To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. Please let us know your opinion in general about the length and data in/out requirements of ATLAS tasks. They are usually much shorter than the other vbox LHC projects - is this a good thing or would you prefer more consistency among the projects? |
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
We would like to see if it's possible to run tasks like these on ATLAS@Home David, will these tasks be run both at ATLAS@Home and LHC@Home, or only at ATLAS@Home? We are the product of random evolution. |
Send message Joined: 13 May 14 Posts: 379 Credit: 15,284,486 RAC: 7,095 ![]() ![]() ![]() |
We would like to see if it's possible to run tasks like these on ATLAS@Home Only on LHC@Home. I know it's a bit confusing during this transition phase... |
![]() Send message Joined: 15 Jun 08 Posts: 2141 Credit: 175,471,418 RAC: 103,848 ![]() ![]() ![]() |
These tasks will run 10 times longer ... This makes feedback/discussion more difficult as it takes very long to get a result. Makes sense if the WUs run nearly 100% reliable. ... output file 10 times as large (500MB) ... Does this mean additional 450 MB RAM during runtime? A couple of users are already fighting to fulfil the RAM requirements for the current WUs. The advantage is that the initial download of 200MB is the same. +1 What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 444 Credit: 165,153,792 RAC: 149,911 ![]() ![]() ![]() |
What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? great idea suggestions: * make a second subproject Atlas1000 or AtlasLongRunners or something similar * hand these WUs only to PCs that had a minimum number of succesfull WUs ![]() Supporting BOINC, a great concept ! |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 444 Credit: 165,153,792 RAC: 149,911 ![]() ![]() ![]() |
These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time. Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ? ![]() Supporting BOINC, a great concept ! |
Send message Joined: 13 May 14 Posts: 379 Credit: 15,284,486 RAC: 7,095 ![]() ![]() ![]() |
What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? Yes that's a good idea. I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects. The long tasks I put in the queue are just a proof of concept to see if it's possible at all. Does this mean additional 450 MB RAM during runtime? No, the output events are stored in files are they are produced so the whole result is not in memory. Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ? No, but it's a good point. The deadline is 2 weeks which should still be long enough. I guess the FLOPS are used to provide the estimated time, so these tasks will definitely run over and will stay at 99.999% completed for a long time. In future the FLOPS should be automatically set by the ATLAS systems generating the tasks. |
![]() Send message Joined: 17 Sep 04 Posts: 88 Credit: 27,800,590 RAC: 10,313 ![]() ![]() ![]() |
I think it's a great idea! Regards, Bob P. |
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? +1 We are the product of random evolution. |
Send message Joined: 6 Sep 08 Posts: 112 Credit: 9,254,071 RAC: 2,043 ![]() ![]() ![]() |
....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects.... Does this mean that you have/will get the native app to run on Ubuntu or will we need another distribution, if so which one will be best? |
Send message Joined: 27 Sep 08 Posts: 744 Credit: 557,096,977 RAC: 309,994 ![]() ![]() ![]() |
How would I know if it hung or long task, I normally abort them if they are 1 day and 99% |
![]() Send message Joined: 17 Sep 04 Posts: 88 Credit: 27,800,590 RAC: 10,313 ![]() ![]() ![]() |
How would I know if it hung or long task, I normally abort them if they are 1 day and 99% From the initial post: To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 444 Credit: 165,153,792 RAC: 149,911 ![]() ![]() ![]() |
I have got two of these longrunners. I have checked the WUs in my local queue(s) with this little DOS-Command: findstr 10959636 \\PHuW10\x$\BOINC_Data\_00\projects\lhcathome.cern.ch_lhcathome\boinc_job_script.* replace the bold part with the path to your BOINC_DATA Directory If you get some output you seem to have one or more of these longrunners. ![]() Supporting BOINC, a great concept ! |
Send message Joined: 13 May 14 Posts: 379 Credit: 15,284,486 RAC: 7,095 ![]() ![]() ![]() |
....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects.... As you saw it was not straightforward to run on Ubuntu out of the box. I am not sure at this moment how much work it would take to make it work. The best supported distributions are flavours of RHEL6 - this is what currently runs inside the ATLAS VM and most sites in the ATLAS grid. I was able to run the native app on CentOS7 no problem, so any recent version of RHEL/Fedora would probably work ok. One the migration to LHC is complete I hope to go back to working on the native app. |
Send message Joined: 14 Jan 10 Posts: 1155 Credit: 7,103,928 RAC: 1,140 ![]() ![]() ![]() |
I got a long runner too: 2017-03-16 10:54:17 (CET) (7684): Guest Log: Starting ATLAS job. (PandaID=3283615871 taskID=10959636) Running on 4 cores with 4400MB RAM. Just on time I decided not to switch back to single core ;) The 'normal' task with that configuration needed 3h41m wall clock (CPU: 11 hours 38 min 38 sec), so this one should take about 37 hours elapsed on my machine. |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 444 Credit: 165,153,792 RAC: 149,911 ![]() ![]() ![]() |
|
Send message Joined: 19 Feb 08 Posts: 707 Credit: 4,335,771 RAC: 15 ![]() ![]() |
I got one with 75 hours estimated time on my 2 core Opteron 1210 running Linux. VirtualBox says 3600 MB. Most Atlas tasks validate on my Linux box and are invalid on the Windows 10 PC. Tullio |
Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0 ![]() ![]() |
Hi , Tullio , 2017-03-16 10:30:54 (6184): Setting Memory Size for VM. (3600MB) crystal pellet and HerveUAE said for 2-core wus , the vm needs near 4400MBytes Ram to run successfully. You have only allocated 3600 MBytes.Try to increase the ram allocated, with an app_config.xml.You have enough ram memory to do it. It may fix the issue. |
Send message Joined: 19 Feb 08 Posts: 707 Credit: 4,335,771 RAC: 15 ![]() ![]() |
Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM. Tullio |
Send message Joined: 14 Jan 10 Posts: 1155 Credit: 7,103,928 RAC: 1,140 ![]() ![]() ![]() |
Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM. I think there is something wrong with the validator for the Linux tasks. No one of your valid tasks on your Linux box displays the HITS*.root result file of about 60MB for upload. IMO those tasks can't be valid. |
©2023 CERN