Thread 'Deadly long ATLAS tasks'

Author	Message
Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 35163 - Posted: 4 May 2018, 12:53:32 UTC Last modified: 4 May 2018, 12:54:28 UTC Hi, I have received a couple of ATLAS tasks which lasted incredibly long for usual length of ATLAS tasks - ie 2d instead of few hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=188619213 The same seems to be happening on an ATLAS task form lhcathome-dev In general, on my Ubuntu machine native ATLAS tasks run fine. I did not have a look at the logs, but maybe you can find something interesting... Cheers, A.[/code] ID: 35163 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,705,474 RAC: 56,709	Message 35165 - Posted: 4 May 2018, 13:06:10 UTC I had very long ATLAS tasks several days ago, they finally failed :-( See my posting here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4620&postid=35085#35085 ID: 35165 · Reply Quote

Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 35276 - Posted: 16 May 2018, 7:08:14 UTC - in response to Message 35165. Hello, I got another couple of these tasks with ATLAS version 2.54 (native_mt) , e.g.: https://lhcathome.cern.ch/lhcathome/result.php?resultid=191180856 The expected duration was few hours, but now it is running since a couple of days. The second one comes from lhcathome-dev, with ATLAS version 0.50 (native_mt). Could someone tell if ATLAS jobs so long are normal or expected? If yes, then there is something wrong with the expected running time; if not, do you have a fix? These long tasks take all the available CPU slots (not many, as I make available my work desktop computer and I use it more as monitoring from the volunteer point of view) for entire days, preventing other tasks to be processed. Are these tasks also the reason behind the drop in the GigaFLOPs reported by the server status page? 1-2 weeks ago we were at >80, but now we are at ~50-60... Thanks, Cheers, A. ID: 35276 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,823,324 RAC: 942	Message 35277 - Posted: 16 May 2018, 7:52:45 UTC Last modified: 16 May 2018, 7:56:10 UTC Have Linux-native Tasks with 40 hours with single CPU and 36 hours with two CPU's. Yes, they are heavy. You get more than 1k Cobblestones. https://lhcathome.cern.ch/lhcathome/img/progresschart.png Edit: Of course only 200 Collisions! ID: 35277 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,073,265 RAC: 9,612	Message 35278 - Posted: 16 May 2018, 8:15:10 UTC days ago I had a 1-core longrunner with the following values: [pre]WallTime=144152.41s KernelTime=298.04s UserTime=143632.60s CPUUsage=99%[/pre] Monitoring tip: Open a console window, cd to your BOINC client's base directory and run the following oneliner [pre]watch -n10 "find ./slots/ \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) \|sort \|xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} \|tail -n1; echo -e '\n'\""[/pre] ID: 35278 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 35293 - Posted: 17 May 2018, 12:59:32 UTC Indeed the current tasks are rather heavier than the previous ones - on average each event takes twice or even three times as long. Some may call them "deadly", others might appreciate the extra credit :) In general if the task is using close to 100% CPU it is still good - this is especially true for native linux tasks where we don't have VirtualBox causing trouble. ID: 35293 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,073,265 RAC: 9,612	Message 35294 - Posted: 17 May 2018, 13:17:52 UTC - in response to Message 35293. ... others might appreciate the extra credit :) What looks like extra credit now will turn into extra low credit once we get work with shorter runtimes. This is caused by the method the credit is calculated. At the end it will average out. ID: 35294 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,823,324 RAC: 942	Message 35295 - Posted: 17 May 2018, 13:50:12 UTC - in response to Message 35293. In general if the task is using close to 100% CPU it is still good - this is especially true for native linux tasks where we don't have VirtualBox causing trouble. ðŸ‘ ID: 35295 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1957 Credit: 158,705,474 RAC: 56,709	Message 35296 - Posted: 17 May 2018, 15:37:40 UTC - in response to Message 35293. Indeed the current tasks are rather heavier than the previous ones - on average each event takes twice or even three times as long. I took this as a reason for trying out 4-core tasks. Although I remember having seen from some charts and comments that 4-core is (markedly?) less efficient, in comparison to 1- or 2-core. So I'll see. ID: 35296 · Reply Quote

Jim Wilkins Send message Joined: 22 Aug 06 Posts: 22 Credit: 466,060 RAC: 0	Message 35360 - Posted: 23 May 2018, 20:10:48 UTC Just FYI...It is taking 90 seconds of run time at 100% to accomplish 1 second of estimated time. I have roughly 660 minutes left, so that comes out about 16-17 hours at a CPU set at 100% usage to complete this task. WOW! ID: 35360 · Reply Quote

Jim Wilkins Send message Joined: 22 Aug 06 Posts: 22 Credit: 466,060 RAC: 0	Message 35363 - Posted: 24 May 2018, 12:25:59 UTC - in response to Message 35360. Well, so much for my predictions. The task suddenly completed and was verified. Interestingly, iit ran 3X normal time , but I got less credit. J ID: 35363 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,073,265 RAC: 9,612	Message 35364 - Posted: 24 May 2018, 13:02:17 UTC - in response to Message 35363. , If you look into your task logs https://lhcathome.cern.ch/lhcathome/result.php?resultid=191413557 https://lhcathome.cern.ch/lhcathome/result.php?resultid=191541487 you may notice lots of lines that show your computer is struggling very hard to run ATLAS. Examples: [pre]2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.trfExe.validate 2018-05-18 20:49:36,502 ERROR Validation of return code failed: EVNTtoHITS got a SIGKILL signal (exit code 137) (Error code 65) 2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.trfExe.validate 2018-05-18 20:49:36,517 INFO Scanning logfile log.EVNTtoHITS for errors 2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.transform.execute 2018-05-18 20:49:36,792 CRITICAL Transform executor raised TransformValidationException: EVNTtoHITS got a SIGKILL signal (exit code 137) 2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.transform.execute 2018-05-18 20:49:40,156 WARNING Transform now exiting early with exit code 65 (EVNTtoHITS got a SIGKILL signal (exit code 137))[/pre] [pre]2018-05-19 14:41:54 (13200): VM state change detected. (old = 'running', new = 'paused') 2018-05-19 14:42:06 (13200): VM state change detected. (old = 'paused', new = 'running') 2018-05-19 14:42:14 (13200): VM state change detected. (old = 'running', new = 'paused') 2018-05-19 14:42:25 (13200): VM state change detected. (old = 'paused', new = 'running') 2018-05-19 14:47:28 (13200): VM state change detected. (old = 'running', new = 'paused')[/pre] The reason is that the recent tasks - when you run them as 1-core or 2-core - need much more RAM than it is configured by the project server. Your logs show that you run them as 1-core (with 3500 MB RAM). As your host has enough RAM you may consider to use an app_config.xml like this: [pre]<app_config> <app> <name>ATLAS</name> <max_concurrent>2</max_concurrent> <report_results_immediately/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--nthreads 1 --memory_size_mb 4800</cmdline> </app_version> </app_config>[/pre] The settings become active with the 1st fresh task that starts after you "reload config files" in your BOINC manager. ID: 35364 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 35375 - Posted: 25 May 2018, 17:11:33 UTC On my Linux laptop I have a one core Atlas task that has been running for 6 days and 22 hours. My last Atlas task on the SUN M20 Linux workstation has completed with a HITS file, so it is a good task. All Windows 10 Atlas tasks (2 CPUs) complete in about 20 minutes and validate, but they produce no HITS files. Tullio ID: 35375 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,073,265 RAC: 9,612	Message 35377 - Posted: 25 May 2018, 19:40:53 UTC - in response to Message 35375. ask logs show a couple of different error and warning messages for all of your hosts. It seems like you configured your #cores and your VM's RAM setting only via the project's web preferences. This leads (most likely) to a too low RAM setting to run the recent ATLAS tasks. You may use the following app_config.xml files to solve the problems. https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10517701 https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10510582 [pre]<app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <report_results_immediately/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>1.0</avg_ncpus> <cmdline>--nthreads 1 --memory_size_mb 4800</cmdline> </app_version> </app_config>[/pre] https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10407309 [pre]<app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> <report_results_immediately/> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>2.0</avg_ncpus> <cmdline>--nthreads 2 --memory_size_mb 4800</cmdline> </app_version> </app_config>[/pre] In addition your windows host shows a rather uncommon error message: [pre]2018-05-24 20:40:00 (568): Error creating VirtualBox instance! rc = 0x80004002[/pre] This may point out a problem regarding your VirtualBox installation. I'm not sure how to solve this - other volunteers may - but you may try a reinstall of VirtualBox. Also be aware that David Cameron announced today that the ATLAS task queue may be dry during this weekend: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4331&postid=35370 ID: 35377 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 35380 - Posted: 26 May 2018, 7:59:38 UTC - in response to Message 35377. Thanks. I have installed VBox 5.2.12 on the SUN Linux WS and the Windows 10 PC. I am waiting for the Linux laptop to finish its task to do the same. I am against all app_config.xml files. Tasks should run out of the box. Tullio ID: 35380 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,073,265 RAC: 9,612	Message 35382 - Posted: 26 May 2018, 9:37:29 UTC - in response to Message 35380. ... I am against all app_config.xml files. Tasks should run out of the box. Using an app_config.xml in this case is like helping a small child when it makes it's first steps. The difference is that a child will learn to walk with or without your help. ID: 35382 · Reply Quote