Message boards :
ATLAS application :
Deadly long ATLAS tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
Hi, I have received a couple of ATLAS tasks which lasted incredibly long for usual length of ATLAS tasks - ie 2d instead of few hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=188619213 The same seems to be happening on an ATLAS task form lhcathome-dev In general, on my Ubuntu machine native ATLAS tasks run fine. I did not have a look at the logs, but maybe you can find something interesting... Cheers, A.[/code] |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,940,891 RAC: 22,334 |
I had very long ATLAS tasks several days ago, they finally failed :-( See my posting here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4620&postid=35085#35085 |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
Hello, I got another couple of these tasks with ATLAS version 2.54 (native_mt) , e.g.: https://lhcathome.cern.ch/lhcathome/result.php?resultid=191180856 The expected duration was few hours, but now it is running since a couple of days. The second one comes from lhcathome-dev, with ATLAS version 0.50 (native_mt). Could someone tell if ATLAS jobs so long are normal or expected? If yes, then there is something wrong with the expected running time; if not, do you have a fix? These long tasks take all the available CPU slots (not many, as I make available my work desktop computer and I use it more as monitoring from the volunteer point of view) for entire days, preventing other tasks to be processed. Are these tasks also the reason behind the drop in the GigaFLOPs reported by the server status page? 1-2 weeks ago we were at >80, but now we are at ~50-60... Thanks, Cheers, A. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Have Linux-native Tasks with 40 hours with single CPU and 36 hours with two CPU's. Yes, they are heavy. You get more than 1k Cobblestones. https://lhcathome.cern.ch/lhcathome/img/progresschart.png Edit: Of course only 200 Collisions! |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
A few days ago I had a 1-core longrunner with the following values: WallTime=144152.41s KernelTime=298.04s UserTime=143632.60s CPUUsage=99% Monitoring tip: Open a console window, cd to your BOINC client's base directory and run the following oneliner watch -n10 "find ./slots/ \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) |sort |xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} |tail -n1; echo -e '\n'\"" |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Indeed the current tasks are rather heavier than the previous ones - on average each event takes twice or even three times as long. Some may call them "deadly", others might appreciate the extra credit :) In general if the task is using close to 100% CPU it is still good - this is especially true for native linux tasks where we don't have VirtualBox causing trouble. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
... others might appreciate the extra credit :) What looks like extra credit now will turn into extra low credit once we get work with shorter runtimes. This is caused by the method the credit is calculated. At the end it will average out. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
In general if the task is using close to 100% CPU it is still good - this is especially true for native linux tasks where we don't have VirtualBox causing trouble. 👠|
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,940,891 RAC: 22,334 |
Indeed the current tasks are rather heavier than the previous ones - on average each event takes twice or even three times as long.I took this as a reason for trying out 4-core tasks. Although I remember having seen from some charts and comments that 4-core is (markedly?) less efficient, in comparison to 1- or 2-core. So I'll see. |
Send message Joined: 22 Aug 06 Posts: 22 Credit: 466,060 RAC: 0 |
Just FYI...It is taking 90 seconds of run time at 100% to accomplish 1 second of estimated time. I have roughly 660 minutes left, so that comes out about 16-17 hours at a CPU set at 100% usage to complete this task. WOW! |
Send message Joined: 22 Aug 06 Posts: 22 Credit: 466,060 RAC: 0 |
Well, so much for my predictions. The task suddenly completed and was verified. Interestingly, iit ran 3X normal time , but I got less credit. J |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Hi Jim, If you look into your task logs https://lhcathome.cern.ch/lhcathome/result.php?resultid=191413557 https://lhcathome.cern.ch/lhcathome/result.php?resultid=191541487 you may notice lots of lines that show your computer is struggling very hard to run ATLAS. Examples: 2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.trfExe.validate 2018-05-18 20:49:36,502 ERROR Validation of return code failed: EVNTtoHITS got a SIGKILL signal (exit code 137) (Error code 65) 2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.trfExe.validate 2018-05-18 20:49:36,517 INFO Scanning logfile log.EVNTtoHITS for errors 2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.transform.execute 2018-05-18 20:49:36,792 CRITICAL Transform executor raised TransformValidationException: EVNTtoHITS got a SIGKILL signal (exit code 137) 2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.transform.execute 2018-05-18 20:49:40,156 WARNING Transform now exiting early with exit code 65 (EVNTtoHITS got a SIGKILL signal (exit code 137)) 2018-05-19 14:41:54 (13200): VM state change detected. (old = 'running', new = 'paused') 2018-05-19 14:42:06 (13200): VM state change detected. (old = 'paused', new = 'running') 2018-05-19 14:42:14 (13200): VM state change detected. (old = 'running', new = 'paused') 2018-05-19 14:42:25 (13200): VM state change detected. (old = 'paused', new = 'running') 2018-05-19 14:47:28 (13200): VM state change detected. (old = 'running', new = 'paused') The reason is that the recent tasks - when you run them as 1-core or 2-core - need much more RAM than it is configured by the project server. Your logs show that you run them as 1-core (with 3500 MB RAM). As your host has enough RAM you may consider to use an app_config.xml like this: <app_config> <app> <name>ATLAS</name> <max_concurrent>2</max_concurrent> <report_results_immediately/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--nthreads 1 --memory_size_mb 4800</cmdline> </app_version> </app_config> The settings become active with the 1st fresh task that starts after you "reload config files" in your BOINC manager. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
On my Linux laptop I have a one core Atlas task that has been running for 6 days and 22 hours. My last Atlas task on the SUN M20 Linux workstation has completed with a HITS file, so it is a good task. All Windows 10 Atlas tasks (2 CPUs) complete in about 20 minutes and validate, but they produce no HITS files. Tullio |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Your task logs show a couple of different error and warning messages for all of your hosts. It seems like you configured your #cores and your VM's RAM setting only via the project's web preferences. This leads (most likely) to a too low RAM setting to run the recent ATLAS tasks. You may use the following app_config.xml files to solve the problems. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10517701 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10510582 <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <report_results_immediately/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--nthreads 1 --memory_size_mb 4800</cmdline> </app_version> </app_config> https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10407309 <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <report_results_immediately/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>2.0</avg_ncpus> <cmdline>--nthreads 2 --memory_size_mb 4800</cmdline> </app_version> </app_config> In addition your windows host shows a rather uncommon error message: 2018-05-24 20:40:00 (568): Error creating VirtualBox instance! rc = 0x80004002 This may point out a problem regarding your VirtualBox installation. I'm not sure how to solve this - other volunteers may - but you may try a reinstall of VirtualBox. Also be aware that David Cameron announced today that the ATLAS task queue may be dry during this weekend: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4331&postid=35370 |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Thanks. I have installed VBox 5.2.12 on the SUN Linux WS and the Windows 10 PC. I am waiting for the Linux laptop to finish its task to do the same. I am against all app_config.xml files. Tasks should run out of the box. Tullio |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
... I am against all app_config.xml files. Tasks should run out of the box. Using an app_config.xml in this case is like helping a small child when it makes it's first steps. The difference is that a child will learn to walk with or without your help. |
©2024 CERN