Message boards : ATLAS application : single core task issues and "top" on the console
Message board moderation

To post messages, you must log in.

AuthorMessage
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32245 - Posted: 5 Sep 2017, 12:21:07 UTC
Last modified: 5 Sep 2017, 12:24:06 UTC

Hi all,

Two pieces of good news:

- Firstly, the long-standing problem with single core tasks is now understood. The problem was that you had to set a much higher memory size than the project sets, otherwise the tasks fail. The reason for this is rather simple (or stupid): due to a configuration error these tasks actually ran in 8-core configuration, with each process using 1/8 of the CPU. We have fixed this error and the new tasks will work properly, although the problematic tasks will take time to drain from the queue so don't remove your app_config.xml files yet! This problem was also the reason the log of event processing didn't appear in the console for single-core tasks.

- Secondly, there is now "top" output available in console 3. On BOINC manager click on "Show VM console" and press Alt+F3. This Linux command displays information on CPU and memory usage as well as the processes using the most CPU. This article contains a nice explanation of the output you see there. A healthy WU should show N athena.py processes using close to 100% CPU where N is the number of cores.
ID: 32245 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,885,257
RAC: 127,460
Message 32246 - Posted: 5 Sep 2017, 12:34:43 UTC - in response to Message 32245.  

David Cameron wrote:
Hi all,

Two pieces of good news: ...

Thats very good news.
Thank you David.

Can you confirm that the RAM formula for the correctly configured WUs is still
2.6 GB + N * 0.8 GB
where N is the number of configured/used cores?
ID: 32246 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32260 - Posted: 5 Sep 2017, 14:33:55 UTC - in response to Message 32246.  

Can you confirm that the RAM formula for the correctly configured WUs is still
2.6 GB + N * 0.8 GB
where N is the number of configured/used cores?


Yes, this is correct.
ID: 32260 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,885,257
RAC: 127,460
Message 32299 - Posted: 6 Sep 2017, 5:18:15 UTC - in response to Message 32245.  

David Cameron wrote:
... the long-standing problem with single core tasks is now understood ...

... and solved.
Well done David.

One of my hosts lately finished a WU that was not only configured as 1-core but also ran as 1-core.
On console 2 it showed the work progress,
on console 3 the top output (a bit flickery but who wants to complain?)

The top output shows that it may be possible to lower the RAM setting according to the formula.
This may be helpful on hosts with less RAM, e.g. 8GB ones.
ID: 32299 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,885,257
RAC: 127,460
Message 32303 - Posted: 6 Sep 2017, 7:30:09 UTC

I'm just testing a 1-core setup with 2800 MB RAM (via app_config.xml).

TOP gives the following stable values:
athena.py: 99 % CPU, 67 % RAM
OS-cache: 325 MB

So far the WU is running smoothly.
ID: 32303 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1409
Credit: 9,325,539
RAC: 9,432
Message 32310 - Posted: 6 Sep 2017, 16:27:29 UTC - in response to Message 32260.  

Can you confirm that the RAM formula for the correctly configured WUs is still
2.6 GB + N * 0.8 GB
where N is the number of configured/used cores?


Yes, this is correct.

Could you change that formula to 2.6 GB + N * 0.9 GB.

That way dual core VM's from users not using an app_config.xml will also be able to run with enough memory (4.4 GB instead of 4.2 GB).
ID: 32310 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,885,257
RAC: 127,460
Message 32312 - Posted: 6 Sep 2017, 17:51:37 UTC - in response to Message 32310.  

Could you change that formula to 2.6 GB + N * 0.9 GB.

That way dual core VM's from users not using an app_config.xml will also be able to run with enough memory (4.4 GB instead of 4.2 GB).

I'm not sure that this is necessary.
Could you please run a 2-core setup with 4200 MB and post the TOP values here so it can be compared to the 1-core setting.
ID: 32312 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 829
Credit: 687,382,599
RAC: 176,255
Message 32315 - Posted: 6 Sep 2017, 19:54:14 UTC - in response to Message 32245.  

Hi David,

If I set the web prefernces for ATLAS Max # jobs to No Limit, then it will only ever run one at a time. If I set to 2 then it will run 2 at one time.

If Max # jobs is greater than # of CPU cores then it will buffer some work up to the job limit.

For the other projects the No limit setting lets BOINC buffer work based on the preferences.

So given this it seems like ATLAS is somehow configured different to the other projects, could this be modifed to work like the other projects?
ID: 32315 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1409
Credit: 9,325,539
RAC: 9,432
Message 32323 - Posted: 7 Sep 2017, 7:19:50 UTC - in response to Message 32312.  

Could you change that formula to 2.6 GB + N * 0.9 GB.

That way dual core VM's from users not using an app_config.xml will also be able to run with enough memory (4.4 GB instead of 4.2 GB).

I'm not sure that this is necessary.
Could you please run a 2-core setup with 4200 MB and post the TOP values here so it can be compared to the 1-core setting.

As you can see after about 9 minutes run time, the memory is very low.
Major problem however is that there seem no swapfile configured.
Like all the jobs I ran with 4200MB in the past, the job could not proceed:
2017-09-07 08:56:32 (14704): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_3474_1504766658/PandaJob/athena_stdout.txt -
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.preExecute 2017-09-07 08:45:44,241 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.preExecute 2017-09-07 08:45:44,243 INFO Now writing wrapper for substep executor EVNTtoHITS
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-09-07 08:45:44,243 INFO Valgrind not engaged
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.preExecute 2017-09-07 08:45:44,243 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.execute 2017-09-07 08:45:44,244 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.execute 2017-09-07 08:51:10,269 INFO EVNTtoHITS executor returns 65
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.validate 2017-09-07 08:51:11,218 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.validate 2017-09-07 08:51:11,246 INFO Scanning logfile log.EVNTtoHITS for errors
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.transform.execute 2017-09-07 08:51:11,755 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.transform.execute 2017-09-07 08:51:15,011 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")



ID: 32323 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 199,929,178
RAC: 57,458
Message 32327 - Posted: 7 Sep 2017, 8:31:29 UTC - in response to Message 32245.  

due to a configuration error these tasks actually ran in 8-core configuration, with each process using 1/8 of the CPU. We have fixed this error and the new tasks will work properly, although the problematic tasks will take time to drain from the queue so don't remove your app_config.xml files yet! This problem was also the reason the log of event processing didn't appear in the console for single-core tasks.

For this problem the TOP-Console is really very helpfull. I watched a single-core-task on a machine "live" and after 15 minutes I could see that the task switched to 8-Core-Crunching and I could abort the WU. This saved me already hours of hours of useless crunching !

Thanks a lot, David


Supporting BOINC, a great concept !
ID: 32327 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 199,929,178
RAC: 57,458
Message 32328 - Posted: 7 Sep 2017, 9:05:31 UTC
Last modified: 7 Sep 2017, 9:07:04 UTC

HM, now I have single-core-tasks that are idling after 30 minutes. A normal WU spins up in round about 15 to 20 minutes, but now these tasks have a runtime of 30 and more minutes but are doing nothing.

No Athena.py, nope, better, it is only running 1 second and then its gone again.



EDIT: Just after I wrote this, athena.py came up and it is crunching now


Supporting BOINC, a great concept !
ID: 32328 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,885,257
RAC: 127,460
Message 32329 - Posted: 7 Sep 2017, 10:40:02 UTC

To avoid some misinterpretations regarding the TOP values.


MEM "used" includes the value from "cached".
Once linux needs RAM, e.g. to start an application, it is taken from free (if available), then from cached.
After a while you will always see very few free MEM but lots of cached.
The free value rises only if applications deallocate RAM or if linux drops parts of the cache as the corresponding files are deleted.


To define/use swap is not mandatory as long as there is enough free+cached RAM.

@CP
In this special case 1.8 GB (from cached) were obviously not enough to start the second athena.py instance.
In addition due to the absence of swap the system crashed (exit code 65).
I made the same experience with a 2-core VM last night.
I guess there were only a few MB missing to survive this phase.

Possible solutions:
- to check the amount of RAM before an additional athena.py starts (preferred)
- to configure more RAM (how much as there are different types of datasets?)
- to configure swap space (only needed for startup)



@Yeti
You may have cancelled your WU too early.
The limiting factor is RAM, not #cores.
If the VM has enough RAM to start all configured athena.py instances (here: 8) it will probably finish successfully.
In this case all athenas would have shared the single core.
I observed this situation during the last days with one of my WUs.
The advantage of TOP is that a volunteer can now make the project people aware of a misconfiguration.
ID: 32329 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1409
Credit: 9,325,539
RAC: 9,432
Message 32330 - Posted: 7 Sep 2017, 12:41:01 UTC - in response to Message 32329.  

@CP
In this special case 1.8 GB (from cached) were obviously not enough to start the second athena.py instance.
In addition due to the absence of swap the system crashed (exit code 65).
I made the same experience with a 2-core VM last night.
I guess there were only a few MB missing to survive this phase.

Possible solutions:
- to check the amount of RAM before an additional athena.py starts (preferred)
- to configure more RAM (how much as there are different types of datasets?)
- to configure swap space (only needed for startup)

It was already mentioned several times that a 2-core VM is not able to run properly with 4.2GB of RAM. It always crashes.
With me a dual core VM with 4.4GB of RAM is always successful.
The best option is to give the VM al least 1 GB swapfile or at least increase the +0.8 GB in the formula to +0.9 GB.
ID: 32330 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,686,911
RAC: 24,977
Message 32341 - Posted: 8 Sep 2017, 6:01:39 UTC - in response to Message 32330.  
Last modified: 8 Sep 2017, 6:03:44 UTC


It was already mentioned several times that a 2-core VM is not able to run properly with 4.2GB of RAM. It always crashes.
With me a dual core VM with 4.4GB of RAM is always successful.
The best option is to give the VM al least 1 GB swapfile or at least increase the +0.8 GB in the formula to +0.9 GB.


Crystal,

in this 2-core VM you can see, that every athena.py need 2,550 GByte.
For me this is also in SL69 native.
In SL69 native there is a swap-file shown in Top from 1,5 GByte.
Have made a reboot and set Memory to 8 GByte.
Now is the swapfile also there with 1,5 GByte, but with Zero MByte in use.
The task is running again with the process.
Had made this reboot after a use of 10% of the Atlas-task.
ID: 32341 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32343 - Posted: 8 Sep 2017, 8:05:11 UTC - in response to Message 32310.  

Can you confirm that the RAM formula for the correctly configured WUs is still
2.6 GB + N * 0.8 GB
where N is the number of configured/used cores?


Yes, this is correct.

Could you change that formula to 2.6 GB + N * 0.9 GB.

That way dual core VM's from users not using an app_config.xml will also be able to run with enough memory (4.4 GB instead of 4.2 GB).


I've updated the memory formula to use 0.9 * N.

I have been investigating a bit with ATLAS software developers these memory issues. It seems that near the beginning of the task there is a large memory spike requiring 4.4GB of RAM available. If the VM has less than that the task fails with the "makePool" error. After this the task uses much less memory, even for 8 cores it only uses 3.5GB in total.

The best way to solve this is probably to add a swap space as you suggest since it will only be used for this short spike.
ID: 32343 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,885,257
RAC: 127,460
Message 32346 - Posted: 8 Sep 2017, 10:25:51 UTC - in response to Message 32341.  

maeax wrote:
in this 2-core VM you can see, that every athena.py need 2,550 GByte.
For me this is also in SL69 native.

This may be misleading.
As you can see in the picture of this post there are 2 athenas running concurrently. Each of them with more than 2500m (VIRT).
This is more than the 4354m the VM has in total.
Nonetheless the VM's OS cache is filled with 1575m and the CPUs run at full speed (100 %; 98 %)
The real athena RAM usage can be seen in the %MEM colum (41.7 %; 41.5 %)


maeax wrote:
In SL69 native there is a swap-file shown in Top from 1,5 GByte.

Are you sure this is not the swap from your host itself, or from your SL69 VM respectively?
ID: 32346 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,686,911
RAC: 24,977
Message 32347 - Posted: 8 Sep 2017, 10:49:52 UTC

ID: 32347 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32349 - Posted: 8 Sep 2017, 14:25:48 UTC - in response to Message 32346.  

maeax wrote:
in this 2-core VM you can see, that every athena.py need 2,550 GByte.
For me this is also in SL69 native.

This may be misleading.
As you can see in the picture of this post there are 2 athenas running concurrently. Each of them with more than 2500m (VIRT).
This is more than the 4354m the VM has in total.
Nonetheless the VM's OS cache is filled with 1575m and the CPUs run at full speed (100 %; 98 %)
The real athena RAM usage can be seen in the %MEM colum (41.7 %; 41.5 %)


The RES column shows what is really being used by the processes and this is also what is in the %MEM column. Virtual memory (VIRT) contains all possible memory addresses of the process but does not reflect how much of that is in RAM.

However, one important fact is that athena processes share memory space with each other, hence the total memory used is not the sum of the RES of each process, because some of that memory is shared between multiple processes. This is of course exactly the benefit of running multi-core because it saves memory. Unfortunately top is not able to show how much memory is shared.

In other words, top is a pretty good way to see if the WU is working or not but cannot give you a really accurate measurement of memory usage.


maeax wrote:
In SL69 native there is a swap-file shown in Top from 1,5 GByte.

Are you sure this is not the swap from your host itself, or from your SL69 VM respectively?


Indeed with the native app the environment is whatever your host provides, so if you configure swap then ATLAS can potentially use it.
ID: 32349 · Report as offensive     Reply Quote

Message boards : ATLAS application : single core task issues and "top" on the console


©2024 CERN