Message boards :
ATLAS application :
Atlas 2 core tasks fail with validation error
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
Hello everyone! Today i wanted to switch my machine from processing all available projects (which runs absolutely smooth including 4 core Atlas tasks) to Atlas only. My goal was to crunch 3 two-core Atlas tasks concurrently. So i switched the preferences two Atlas only with a maximum of 2 cores. This works fine so far, but all tasks finish with a validation error. I have no clue what's wrong here, since the stderr shows no real error. Anybody got a hint? https://lhcathome.cern.ch/lhcathome/result.php?resultid=137977639 Thanks and regards, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 15 Jun 08 Posts: 2530 Credit: 253,722,201 RAC: 41,981 |
I have no clue what's wrong here, since the stderr shows no real error. Of course, it does: 2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.preExecute 2017-05-08 17:28:38,041 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS Anybody got a hint? At least spend more RAM for each VM. 4200 MB is standard according to the project but sometimes not enough. Your host has enough RAM to set 5000 MB per VM via app_config.xml. If the errors persist you may think about an update of your VirtualBox version. |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
There you go...how could i miss those lines? Seems to be time for a visit at the ophthalmologist. Thanks for your suggestion, I will raise the RAM for those wu's and try again. I don't think it's about the virtualbox version, because 4 core Atlas tasks run without any problems at all!? Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
Just for feedback: I got it up and running. I set project to no new work and waited for all running tasks to finish. Then i reset the project and closed Boinc. Afterwards i updated Virtualbox to the latest version and rebooted the system. After creating the app_config.xml file i set the according preferences on the website and fired up Boinc again. Since then everything works fine. By the way: are those infos from the old Atlas forum still valid? http://atlasathome.cern.ch/forum_thread.php?id=568 Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 27 Mar 15 Posts: 3 Credit: 595,988 RAC: 0 |
Interesting - I've been getting the same problem with several of my Atlas tasks after recently switching to 2 CPUs. Not all of them, but a lot of them. I'm already on the latest version of VirtualBox (5.1.12) according to apt-get, but will change it to use 3 CPUs instead (like I had before) and see how that goes. Thanks for the info. |
Send message Joined: 27 Sep 08 Posts: 846 Credit: 691,144,006 RAC: 109,022 |
The memory usage was tweeked a little bit: 2.6GB + 0.8GB * ncores |
Send message Joined: 27 Mar 15 Posts: 3 Credit: 595,988 RAC: 0 |
I switched back to 3 CPUs and tasks seem to be working ok again now. Thanks! :-) |
Send message Joined: 15 Jun 08 Posts: 2530 Credit: 253,722,201 RAC: 41,981 |
IIRC David Cameron wrote in one of his posts that the recent RAM formula may lead to not enough RAM for VMs with a low CPU number, i.e. 1 or 2. You may test a 2-core VM with 5000 MB RAM (or slightly below) in your app_config.xml. Your 3-core setup seems to run ok but there are errors in the logfile and a HITS-file is missing: https://lhcathome.cern.ch/lhcathome/result.php?resultid=139910590 2017-05-14 04:35:50 (26083): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_16174_1494731786/PandaJob_3377360315_1494731796/athena_stdout.txt - |
Send message Joined: 27 Mar 15 Posts: 3 Credit: 595,988 RAC: 0 |
there are errors in the logfile and a HITS-file is missing I hadn't spotted that, though this seems to be just in the guest log messages, not anything wrong with my setup, and I can't see any similar messages in later tasks, so it could just be a one-off. Also, that task went on to complete ok : 2017-05-14 04:35:52 (26083): Guest Log: Successfully finished the ATLAS job! I've altered the preferences to use all 4 of my CPUs now (I was only throttling it to 3 to let some other tasks get through at the same time). These seem fine as well, all are now getting successful finishes and validating ok, so I'm happy anyway. :-) |
Send message Joined: 18 Dec 15 Posts: 1811 Credit: 118,317,631 RAC: 26,631 |
I just had a failed 2-core task. Excerpt from Stderr: PyJobTransforms.trfExe.validate 2017-05-27 09:48:01,962 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) complete report can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=144126913 any idea what is the Problem? |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
I have seen this error in your log: FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider I understand that this is due to insufficient memory allocated to the VM. I have had a few of those in the past and actually increased the memory to 7000 for 2-core, which seems to have reduced the frequency of this type of error. We are the product of random evolution. |
©2024 CERN