Message boards : ATLAS application : Deadly long ATLAS tasks
Message board moderation

To post messages, you must log in.

AuthorMessage
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 35163 - Posted: 4 May 2018, 12:53:32 UTC
Last modified: 4 May 2018, 12:54:28 UTC

Hi,
I have received a couple of ATLAS tasks which lasted incredibly long for usual length of ATLAS tasks - ie 2d instead of few hours:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=188619213
The same seems to be happening on an ATLAS task form
lhcathome-dev


In general, on my Ubuntu machine native ATLAS tasks run fine. I did not have a look at the logs, but maybe you can find something interesting...
Cheers,
A.[/code]
ID: 35163 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,372,696
RAC: 102,193
Message 35165 - Posted: 4 May 2018, 13:06:10 UTC

I had very long ATLAS tasks several days ago, they finally failed :-(

See my posting here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4620&postid=35085#35085
ID: 35165 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 35276 - Posted: 16 May 2018, 7:08:14 UTC - in response to Message 35165.  

Hello,
I got another couple of these tasks with ATLAS version 2.54 (native_mt) , e.g.:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=191180856
The expected duration was few hours, but now it is running since a couple of days. The second one comes from lhcathome-dev, with ATLAS version 0.50 (native_mt).

Could someone tell if ATLAS jobs so long are normal or expected? If yes, then there is something wrong with the expected running time; if not, do you have a fix?
These long tasks take all the available CPU slots (not many, as I make available my work desktop computer and I use it more as monitoring from the volunteer point of view) for entire days, preventing other tasks to be processed.

Are these tasks also the reason behind the drop in the GigaFLOPs reported by the server status page? 1-2 weeks ago we were at >80, but now we are at ~50-60...
Thanks,
Cheers,
A.
ID: 35276 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,106,354
RAC: 103,967
Message 35277 - Posted: 16 May 2018, 7:52:45 UTC
Last modified: 16 May 2018, 7:56:10 UTC

Have Linux-native Tasks with 40 hours with single CPU and 36 hours with two CPU's.
Yes, they are heavy. You get more than 1k Cobblestones.
https://lhcathome.cern.ch/lhcathome/img/progresschart.png
Edit: Of course only 200 Collisions!
ID: 35277 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,931,335
RAC: 137,631
Message 35278 - Posted: 16 May 2018, 8:15:10 UTC

A few days ago I had a 1-core longrunner with the following values:
WallTime=144152.41s
KernelTime=298.04s
UserTime=143632.60s
CPUUsage=99%



Monitoring tip:
Open a console window, cd to your BOINC client's base directory and run the following oneliner
watch -n10 "find ./slots/ \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) |sort |xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} |tail -n1; echo -e '\n'\""
ID: 35278 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 35293 - Posted: 17 May 2018, 12:59:32 UTC

Indeed the current tasks are rather heavier than the previous ones - on average each event takes twice or even three times as long. Some may call them "deadly", others might appreciate the extra credit :)

In general if the task is using close to 100% CPU it is still good - this is especially true for native linux tasks where we don't have VirtualBox causing trouble.
ID: 35293 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,931,335
RAC: 137,631
Message 35294 - Posted: 17 May 2018, 13:17:52 UTC - in response to Message 35293.  

... others might appreciate the extra credit :)

What looks like extra credit now will turn into extra low credit once we get work with shorter runtimes.
This is caused by the method the credit is calculated.
At the end it will average out.
ID: 35294 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,106,354
RAC: 103,967
Message 35295 - Posted: 17 May 2018, 13:50:12 UTC - in response to Message 35293.  

In general if the task is using close to 100% CPU it is still good - this is especially true for native linux tasks where we don't have VirtualBox causing trouble.

👍
ID: 35295 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,372,696
RAC: 102,193
Message 35296 - Posted: 17 May 2018, 15:37:40 UTC - in response to Message 35293.  

Indeed the current tasks are rather heavier than the previous ones - on average each event takes twice or even three times as long.
I took this as a reason for trying out 4-core tasks. Although I remember having seen from some charts and comments that 4-core is (markedly?) less efficient, in comparison to 1- or 2-core. So I'll see.
ID: 35296 · Report as offensive     Reply Quote
Jim Wilkins

Send message
Joined: 22 Aug 06
Posts: 22
Credit: 466,060
RAC: 0
Message 35360 - Posted: 23 May 2018, 20:10:48 UTC

Just FYI...It is taking 90 seconds of run time at 100% to accomplish 1 second of estimated time. I have roughly 660 minutes left, so that comes out about 16-17 hours at a CPU set at 100% usage to complete this task. WOW!
ID: 35360 · Report as offensive     Reply Quote
Jim Wilkins

Send message
Joined: 22 Aug 06
Posts: 22
Credit: 466,060
RAC: 0
Message 35363 - Posted: 24 May 2018, 12:25:59 UTC - in response to Message 35360.  

Well, so much for my predictions. The task suddenly completed and was verified. Interestingly, iit ran 3X normal time , but I got less credit.

J
ID: 35363 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,931,335
RAC: 137,631
Message 35364 - Posted: 24 May 2018, 13:02:17 UTC - in response to Message 35363.  

Hi Jim,

If you look into your task logs
https://lhcathome.cern.ch/lhcathome/result.php?resultid=191413557
https://lhcathome.cern.ch/lhcathome/result.php?resultid=191541487
you may notice lots of lines that show your computer is struggling very hard to run ATLAS.

Examples:
2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.trfExe.validate 2018-05-18 20:49:36,502 ERROR Validation of return code failed: EVNTtoHITS got a SIGKILL signal (exit code 137) (Error code 65)
2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.trfExe.validate 2018-05-18 20:49:36,517 INFO Scanning logfile log.EVNTtoHITS for errors
2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.transform.execute 2018-05-18 20:49:36,792 CRITICAL Transform executor raised TransformValidationException: EVNTtoHITS got a SIGKILL signal (exit code 137)
2018-05-18 14:51:16 (4482): Guest Log: PyJobTransforms.transform.execute 2018-05-18 20:49:40,156 WARNING Transform now exiting early with exit code 65 (EVNTtoHITS got a SIGKILL signal (exit code 137))

2018-05-19 14:41:54 (13200): VM state change detected. (old = 'running', new = 'paused')
2018-05-19 14:42:06 (13200): VM state change detected. (old = 'paused', new = 'running')
2018-05-19 14:42:14 (13200): VM state change detected. (old = 'running', new = 'paused')
2018-05-19 14:42:25 (13200): VM state change detected. (old = 'paused', new = 'running')
2018-05-19 14:47:28 (13200): VM state change detected. (old = 'running', new = 'paused')


The reason is that the recent tasks - when you run them as 1-core or 2-core - need much more RAM than it is configured by the project server.
Your logs show that you run them as 1-core (with 3500 MB RAM).

As your host has enough RAM you may consider to use an app_config.xml like this:
<app_config>
  <app>
    <name>ATLAS</name>
    <max_concurrent>2</max_concurrent>
    <report_results_immediately/>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>1.0</avg_ncpus>
    <cmdline>--nthreads 1 --memory_size_mb 4800</cmdline>
  </app_version>
</app_config>

The settings become active with the 1st fresh task that starts after you "reload config files" in your BOINC manager.
ID: 35364 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 35375 - Posted: 25 May 2018, 17:11:33 UTC

On my Linux laptop I have a one core Atlas task that has been running for 6 days and 22 hours. My last Atlas task on the SUN M20 Linux workstation has completed with a HITS file, so it is a good task. All Windows 10 Atlas tasks (2 CPUs) complete in about 20 minutes and validate, but they produce no HITS files.
Tullio
ID: 35375 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,931,335
RAC: 137,631
Message 35377 - Posted: 25 May 2018, 19:40:53 UTC - in response to Message 35375.  

Your task logs show a couple of different error and warning messages for all of your hosts.
It seems like you configured your #cores and your VM's RAM setting only via the project's web preferences.
This leads (most likely) to a too low RAM setting to run the recent ATLAS tasks.

You may use the following app_config.xml files to solve the problems.

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10517701
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10510582
<app_config>
  <app>
    <name>ATLAS</name>
    <max_concurrent>1</max_concurrent>
    <report_results_immediately/>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>1.0</avg_ncpus>
    <cmdline>--nthreads 1 --memory_size_mb 4800</cmdline>
  </app_version>
</app_config>




https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10407309
<app_config>
  <app>
    <name>ATLAS</name>
    <max_concurrent>1</max_concurrent>
    <report_results_immediately/>
  </app>
  <app_version>
    <app_name>ATLAS</app_name>
    <plan_class>vbox64_mt_mcore_atlas</plan_class>
    <avg_ncpus>2.0</avg_ncpus>
    <cmdline>--nthreads 2 --memory_size_mb 4800</cmdline>
  </app_version>
</app_config>



In addition your windows host shows a rather uncommon error message:
2018-05-24 20:40:00 (568): Error creating VirtualBox instance! rc = 0x80004002

This may point out a problem regarding your VirtualBox installation.
I'm not sure how to solve this - other volunteers may - but you may try a reinstall of VirtualBox.


Also be aware that David Cameron announced today that the ATLAS task queue may be dry during this weekend:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4331&postid=35370
ID: 35377 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 35380 - Posted: 26 May 2018, 7:59:38 UTC - in response to Message 35377.  

Thanks. I have installed VBox 5.2.12 on the SUN Linux WS and the Windows 10 PC. I am waiting for the Linux laptop to finish its task to do the same. I am against all app_config.xml files. Tasks should run out of the box.
Tullio
ID: 35380 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,931,335
RAC: 137,631
Message 35382 - Posted: 26 May 2018, 9:37:29 UTC - in response to Message 35380.  

... I am against all app_config.xml files. Tasks should run out of the box.

Using an app_config.xml in this case is like helping a small child when it makes it's first steps.
The difference is that a child will learn to walk with or without your help.
ID: 35382 · Report as offensive     Reply Quote

Message boards : ATLAS application : Deadly long ATLAS tasks


©2024 CERN