Message boards :
ATLAS application :
single core task issues and "top" on the console
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Hi all, Two pieces of good news: - Firstly, the long-standing problem with single core tasks is now understood. The problem was that you had to set a much higher memory size than the project sets, otherwise the tasks fail. The reason for this is rather simple (or stupid): due to a configuration error these tasks actually ran in 8-core configuration, with each process using 1/8 of the CPU. We have fixed this error and the new tasks will work properly, although the problematic tasks will take time to drain from the queue so don't remove your app_config.xml files yet! This problem was also the reason the log of event processing didn't appear in the console for single-core tasks. - Secondly, there is now "top" output available in console 3. On BOINC manager click on "Show VM console" and press Alt+F3. This Linux command displays information on CPU and memory usage as well as the processes using the most CPU. This article contains a nice explanation of the output you see there. A healthy WU should show N athena.py processes using close to 100% CPU where N is the number of cores. |
Send message Joined: 15 Jun 08 Posts: 2509 Credit: 248,926,125 RAC: 129,317 |
David Cameron wrote: Hi all, Thats very good news. Thank you David. Can you confirm that the RAM formula for the correctly configured WUs is still 2.6 GB + N * 0.8 GB where N is the number of configured/used cores? |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Can you confirm that the RAM formula for the correctly configured WUs is still Yes, this is correct. |
Send message Joined: 15 Jun 08 Posts: 2509 Credit: 248,926,125 RAC: 129,317 |
David Cameron wrote: ... the long-standing problem with single core tasks is now understood ... ... and solved. Well done David. One of my hosts lately finished a WU that was not only configured as 1-core but also ran as 1-core. On console 2 it showed the work progress, on console 3 the top output (a bit flickery but who wants to complain?) The top output shows that it may be possible to lower the RAM setting according to the formula. This may be helpful on hosts with less RAM, e.g. 8GB ones. |
Send message Joined: 15 Jun 08 Posts: 2509 Credit: 248,926,125 RAC: 129,317 |
I'm just testing a 1-core setup with 2800 MB RAM (via app_config.xml). TOP gives the following stable values: athena.py: 99 % CPU, 67 % RAM OS-cache: 325 MB So far the WU is running smoothly. |
Send message Joined: 14 Jan 10 Posts: 1378 Credit: 9,162,540 RAC: 5,071 |
Can you confirm that the RAM formula for the correctly configured WUs is still Could you change that formula to 2.6 GB + N * 0.9 GB. That way dual core VM's from users not using an app_config.xml will also be able to run with enough memory (4.4 GB instead of 4.2 GB). |
Send message Joined: 15 Jun 08 Posts: 2509 Credit: 248,926,125 RAC: 129,317 |
Could you change that formula to 2.6 GB + N * 0.9 GB. I'm not sure that this is necessary. Could you please run a 2-core setup with 4200 MB and post the TOP values here so it can be compared to the 1-core setting. |
Send message Joined: 27 Sep 08 Posts: 823 Credit: 684,419,873 RAC: 143,943 |
Hi David, If I set the web prefernces for ATLAS Max # jobs to No Limit, then it will only ever run one at a time. If I set to 2 then it will run 2 at one time. If Max # jobs is greater than # of CPU cores then it will buffer some work up to the job limit. For the other projects the No limit setting lets BOINC buffer work based on the preferences. So given this it seems like ATLAS is somehow configured different to the other projects, could this be modifed to work like the other projects? |
Send message Joined: 14 Jan 10 Posts: 1378 Credit: 9,162,540 RAC: 5,071 |
Could you change that formula to 2.6 GB + N * 0.9 GB. As you can see after about 9 minutes run time, the memory is very low. Major problem however is that there seem no swapfile configured. Like all the jobs I ran with 4200MB in the past, the job could not proceed: 2017-09-07 08:56:32 (14704): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_3474_1504766658/PandaJob/athena_stdout.txt - 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.preExecute 2017-09-07 08:45:44,241 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.preExecute 2017-09-07 08:45:44,243 INFO Now writing wrapper for substep executor EVNTtoHITS 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-09-07 08:45:44,243 INFO Valgrind not engaged 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.preExecute 2017-09-07 08:45:44,243 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh'] 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.execute 2017-09-07 08:45:44,244 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh']) 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.execute 2017-09-07 08:51:10,269 INFO EVNTtoHITS executor returns 65 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.validate 2017-09-07 08:51:11,218 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.trfExe.validate 2017-09-07 08:51:11,246 INFO Scanning logfile log.EVNTtoHITS for errors 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.transform.execute 2017-09-07 08:51:11,755 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2017-09-07 08:56:32 (14704): Guest Log: PyJobTransforms.transform.execute 2017-09-07 08:51:15,011 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 198,912,914 RAC: 87,049 |
due to a configuration error these tasks actually ran in 8-core configuration, with each process using 1/8 of the CPU. We have fixed this error and the new tasks will work properly, although the problematic tasks will take time to drain from the queue so don't remove your app_config.xml files yet! This problem was also the reason the log of event processing didn't appear in the console for single-core tasks. For this problem the TOP-Console is really very helpfull. I watched a single-core-task on a machine "live" and after 15 minutes I could see that the task switched to 8-Core-Crunching and I could abort the WU. This saved me already hours of hours of useless crunching ! Thanks a lot, David Supporting BOINC, a great concept ! |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 198,912,914 RAC: 87,049 |
HM, now I have single-core-tasks that are idling after 30 minutes. A normal WU spins up in round about 15 to 20 minutes, but now these tasks have a runtime of 30 and more minutes but are doing nothing. No Athena.py, nope, better, it is only running 1 second and then its gone again. EDIT: Just after I wrote this, athena.py came up and it is crunching now Supporting BOINC, a great concept ! |
Send message Joined: 15 Jun 08 Posts: 2509 Credit: 248,926,125 RAC: 129,317 |
To avoid some misinterpretations regarding the TOP values. MEM "used" includes the value from "cached". Once linux needs RAM, e.g. to start an application, it is taken from free (if available), then from cached. After a while you will always see very few free MEM but lots of cached. The free value rises only if applications deallocate RAM or if linux drops parts of the cache as the corresponding files are deleted. To define/use swap is not mandatory as long as there is enough free+cached RAM. @CP In this special case 1.8 GB (from cached) were obviously not enough to start the second athena.py instance. In addition due to the absence of swap the system crashed (exit code 65). I made the same experience with a 2-core VM last night. I guess there were only a few MB missing to survive this phase. Possible solutions: - to check the amount of RAM before an additional athena.py starts (preferred) - to configure more RAM (how much as there are different types of datasets?) - to configure swap space (only needed for startup) @Yeti You may have cancelled your WU too early. The limiting factor is RAM, not #cores. If the VM has enough RAM to start all configured athena.py instances (here: 8) it will probably finish successfully. In this case all athenas would have shared the single core. I observed this situation during the last days with one of my WUs. The advantage of TOP is that a volunteer can now make the project people aware of a misconfiguration. |
Send message Joined: 14 Jan 10 Posts: 1378 Credit: 9,162,540 RAC: 5,071 |
@CP It was already mentioned several times that a 2-core VM is not able to run properly with 4.2GB of RAM. It always crashes. With me a dual core VM with 4.4GB of RAM is always successful. The best option is to give the VM al least 1 GB swapfile or at least increase the +0.8 GB in the formula to +0.9 GB. |
Send message Joined: 2 May 07 Posts: 2193 Credit: 173,357,424 RAC: 50,850 |
Crystal, in this 2-core VM you can see, that every athena.py need 2,550 GByte. For me this is also in SL69 native. In SL69 native there is a swap-file shown in Top from 1,5 GByte. Have made a reboot and set Memory to 8 GByte. Now is the swapfile also there with 1,5 GByte, but with Zero MByte in use. The task is running again with the process. Had made this reboot after a use of 10% of the Atlas-task. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Can you confirm that the RAM formula for the correctly configured WUs is still I've updated the memory formula to use 0.9 * N. I have been investigating a bit with ATLAS software developers these memory issues. It seems that near the beginning of the task there is a large memory spike requiring 4.4GB of RAM available. If the VM has less than that the task fails with the "makePool" error. After this the task uses much less memory, even for 8 cores it only uses 3.5GB in total. The best way to solve this is probably to add a swap space as you suggest since it will only be used for this short spike. |
Send message Joined: 15 Jun 08 Posts: 2509 Credit: 248,926,125 RAC: 129,317 |
maeax wrote: in this 2-core VM you can see, that every athena.py need 2,550 GByte. This may be misleading. As you can see in the picture of this post there are 2 athenas running concurrently. Each of them with more than 2500m (VIRT). This is more than the 4354m the VM has in total. Nonetheless the VM's OS cache is filled with 1575m and the CPUs run at full speed (100 %; 98 %) The real athena RAM usage can be seen in the %MEM colum (41.7 %; 41.5 %) maeax wrote: In SL69 native there is a swap-file shown in Top from 1,5 GByte. Are you sure this is not the swap from your host itself, or from your SL69 VM respectively? |
Send message Joined: 2 May 07 Posts: 2193 Credit: 173,357,424 RAC: 50,850 |
|
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
maeax wrote:in this 2-core VM you can see, that every athena.py need 2,550 GByte. The RES column shows what is really being used by the processes and this is also what is in the %MEM column. Virtual memory (VIRT) contains all possible memory addresses of the process but does not reflect how much of that is in RAM. However, one important fact is that athena processes share memory space with each other, hence the total memory used is not the sum of the RES of each process, because some of that memory is shared between multiple processes. This is of course exactly the benefit of running multi-core because it saves memory. Unfortunately top is not able to show how much memory is shared. In other words, top is a pretty good way to see if the WU is working or not but cannot give you a really accurate measurement of memory usage.
Indeed with the native app the environment is whatever your host provides, so if you configure swap then ATLAS can potentially use it. |
©2024 CERN