Very long tasks in the queue

Author	Message
David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29276 - Posted: 14 Mar 2017, 21:47:58 UTC The current ATLAS tasks process 100 events, but as an experiment we have sent some tasks with 1000 events. We would like to see if it's possible to run tasks like these on ATLAS@Home because this is the same number of events each task processes on the ATLAS grid. It would make things a lot easier if the same tasks could run on ATLAS@Home as on the rest of the ATLAS grid. These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time. To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. Please let us know your opinion in general about the length and data in/out requirements of ATLAS tasks. They are usually much shorter than the other vbox LHC projects - is this a good thing or would you prefer more consistency among the projects? ID: 29276 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29281 - Posted: 15 Mar 2017, 1:59:09 UTC We would like to see if it's possible to run tasks like these on ATLAS@Home David, will these tasks be run both at ATLAS@Home and LHC@Home, or only at ATLAS@Home? We are the product of random evolution. ID: 29281 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29284 - Posted: 15 Mar 2017, 8:28:56 UTC - in response to Message 29281. We would like to see if it's possible to run tasks like these on ATLAS@Home David, will these tasks be run both at ATLAS@Home and LHC@Home, or only at ATLAS@Home? Only on LHC@Home. I know it's a bit confusing during this transition phase... ID: 29284 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,732,000 RAC: 88,628	Message 29286 - Posted: 15 Mar 2017, 9:03:13 UTC These tasks will run 10 times longer ... This makes feedback/discussion more difficult as it takes very long to get a result. Makes sense if the WUs run nearly 100% reliable. ... output file 10 times as large (500MB) ... Does this mean additional 450 MB RAM during runtime? A couple of users are already fighting to fulfil the RAM requirements for the current WUs. The advantage is that the initial download of 200MB is the same. +1 What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? ID: 29286 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,679,055 RAC: 50,842	Message 29297 - Posted: 15 Mar 2017, 11:39:06 UTC - in response to Message 29286. What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? great idea suggestions: * make a second subproject Atlas1000 or AtlasLongRunners or something similar * hand these WUs only to PCs that had a minimum number of succesfull WUs Supporting BOINC, a great concept ! ID: 29297 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,679,055 RAC: 50,842	Message 29298 - Posted: 15 Mar 2017, 11:45:10 UTC - in response to Message 29276. These tasks will run 10 times longer than the other tasks and will generate an output file 10 times as large (500MB), so this may be an issue for those of you with low upload bandwidth. The advantage is that the initial download of 200MB is the same. Obviously using more cores will be better for these tasks, so they finish in a reasonable time. To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ? Supporting BOINC, a great concept ! ID: 29298 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29301 - Posted: 15 Mar 2017, 13:53:54 UTC What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? Yes that's a good idea. I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects. The long tasks I put in the queue are just a proof of concept to see if it's possible at all. Does this mean additional 450 MB RAM during runtime? No, the output events are stored in files are they are produced so the whole result is not in memory. Did you increase the needed FLOPS or Deadline or anything that the BOINC-Client can recognize that the 10x runtime is normal behaviour ? No, but it's a good point. The deadline is 2 weeks which should still be long enough. I guess the FLOPS are used to provide the estimated time, so these tasks will definitely run over and will stay at 99.999% completed for a long time. In future the FLOPS should be automatically set by the ATLAS systems generating the tasks. ID: 29301 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 35,545,798 RAC: 13,939	Message 29303 - Posted: 15 Mar 2017, 15:18:55 UTC - in response to Message 29301. I think it's a great idea! Regards, Bob P. ID: 29303 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29306 - Posted: 15 Mar 2017, 16:50:09 UTC What about the idea to separate those WUs (own subproject, own plan class, ...) and let high potential users opt in? +1 We are the product of random evolution. ID: 29306 · Reply Quote

m Send message Joined: 6 Sep 08 Posts: 119 Credit: 14,228,376 RAC: 9,434	Message 29313 - Posted: 15 Mar 2017, 20:53:19 UTC - in response to Message 29301. ....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects.... Does this mean that you have/will get the native app to run on Ubuntu or will we need another distribution, if so which one will be best? ID: 29313 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 878 Credit: 744,449,319 RAC: 298,572	Message 29314 - Posted: 15 Mar 2017, 21:04:19 UTC How would I know if it hung or long task, I normally abort them if they are 1 day and 99% ID: 29314 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 35,545,798 RAC: 13,939	Message 29315 - Posted: 15 Mar 2017, 21:20:23 UTC - in response to Message 29314. How would I know if it hung or long task, I normally abort them if they are 1 day and 99% From the initial post: To know if you are running one of these tasks and that it's not a regular "longrunner" you can check the stderr.txt in the slots directory - if it shows "Starting ATLAS job. (PandaID=xxx taskID: taskID=10959636)" then you got one. The regular tasks have taskID=10947180. ID: 29315 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,679,055 RAC: 50,842	Message 29318 - Posted: 16 Mar 2017, 6:53:56 UTC I have got two of these longrunners. I have checked the WUs in my local queue(s) with this little DOS-Command: findstr 10959636 \\PHuW10\x$\BOINC_Data\_00\projects\lhcathome.cern.ch_lhcathome\boinc_job_script.* replace the bold part with the path to your BOINC_DATA Directory If you get some output you seem to have one or more of these longrunners. Supporting BOINC, a great concept ! ID: 29318 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29319 - Posted: 16 Mar 2017, 8:16:27 UTC - in response to Message 29313. ....I was thinking of making something like an "ATLAS pro" app for serious ATLAS crunchers with long tasks and also the native Linux version, and keep the normal tasks for newcomers or those who crunch many projects.... Does this mean that you have/will get the native app to run on Ubuntu or will we need another distribution, if so which one will be best? As you saw it was not straightforward to run on Ubuntu out of the box. I am not sure at this moment how much work it would take to make it work. The best supported distributions are flavours of RHEL6 - this is what currently runs inside the ATLAS VM and most sites in the ATLAS grid. I was able to run the native app on CentOS7 no problem, so any recent version of RHEL/Fedora would probably work ok. One the migration to LHC is complete I hope to go back to working on the native app. ID: 29319 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1460 Credit: 9,850,567 RAC: 3,704	Message 29322 - Posted: 16 Mar 2017, 10:07:34 UTC I got a long runner too: 2017-03-16 10:54:17 (CET) (7684): Guest Log: Starting ATLAS job. (PandaID=3283615871 taskID=10959636) Running on 4 cores with 4400MB RAM. Just on time I decided not to switch back to single core ;) The 'normal' task with that configuration needed 3h41m wall clock (CPU: 11 hours 38 min 38 sec), so this one should take about 37 hours elapsed on my machine. ID: 29322 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 214,679,055 RAC: 50,842	Message 29324 - Posted: 16 Mar 2017, 10:10:42 UTC My Longrunners show a 12 - 24 hour runtime, that's good Thx David for fixing this Supporting BOINC, a great concept ! ID: 29324 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 29329 - Posted: 16 Mar 2017, 14:36:09 UTC Last modified: 16 Mar 2017, 14:36:58 UTC I got one with 75 hours estimated time on my 2 core Opteron 1210 running Linux. VirtualBox says 3600 MB. Most Atlas tasks validate on my Linux box and are invalid on the Windows 10 PC. Tullio ID: 29329 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 29331 - Posted: 16 Mar 2017, 16:48:53 UTC - in response to Message 29329. Hi , Tullio , 2017-03-16 10:30:54 (6184): Setting Memory Size for VM. (3600MB) 2017-03-16 10:30:54 (6184): Setting CPU Count for VM. (2) crystal pellet and HerveUAE said for 2-core wus , the vm needs near 4400MBytes Ram to run successfully. You have only allocated 3600 MBytes.Try to increase the ram allocated, with an app_config.xml.You have enough ram memory to do it. It may fix the issue. ID: 29331 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 29332 - Posted: 16 Mar 2017, 16:51:58 UTC - in response to Message 29331. Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM. Tullio ID: 29332 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1460 Credit: 9,850,567 RAC: 3,704	Message 29333 - Posted: 16 Mar 2017, 17:04:48 UTC - in response to Message 29332. Ok, but they validate on the Linux box even if the elapsed time is greater that the CPU time. They don't validate on the Windows 10 PC, which has much more RAM. Tullio I think there is something wrong with the validator for the Linux tasks. No one of your valid tasks on your Linux box displays the HITS*.root result file of about 60MB for upload. IMO those tasks can't be valid. ID: 29333 · Reply Quote

LHC@home