log in

Atlas 2 core tasks fail with validation error


Advanced search

Message boards : ATLAS application : Atlas 2 core tasks fail with validation error

Author Message
djoser
Send message
Joined: 30 Aug 14
Posts: 20
Credit: 1,872,063
RAC: 783
Message 30249 - Posted: 8 May 2017, 15:51:01 UTC

Hello everyone!

Today i wanted to switch my machine from processing all available projects (which runs absolutely smooth including 4 core Atlas tasks) to Atlas only.

My goal was to crunch 3 two-core Atlas tasks concurrently.
So i switched the preferences two Atlas only with a maximum of 2 cores.
This works fine so far, but all tasks finish with a validation error.

I have no clue what's wrong here, since the stderr shows no real error.

Anybody got a hint?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=137977639

Thanks and regards,
djoser.
____________
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 30250 - Posted: 8 May 2017, 16:16:18 UTC - in response to Message 30249.

I have no clue what's wrong here, since the stderr shows no real error.


Of course, it does:
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.preExecute 2017-05-08 17:28:38,041 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.preExecute 2017-05-08 17:28:38,043 INFO Now writing wrapper for substep executor EVNTtoHITS
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-05-08 17:28:38,044 INFO Valgrind not engaged
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.preExecute 2017-05-08 17:28:38,044 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.execute 2017-05-08 17:28:38,044 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.execute 2017-05-08 17:32:52,114 INFO EVNTtoHITS executor returns 65
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.validate 2017-05-08 17:32:53,022 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.trfExe.validate 2017-05-08 17:32:53,033 INFO Scanning logfile log.EVNTtoHITS for errors
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.transform.execute 2017-05-08 17:32:53,198 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-05-08 17:35:04 (5436): Guest Log: PyJobTransforms.transform.execute 2017-05-08 17:32:56,278 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")



Anybody got a hint?


At least spend more RAM for each VM.
4200 MB is standard according to the project but sometimes not enough.
Your host has enough RAM to set 5000 MB per VM via app_config.xml.

If the errors persist you may think about an update of your VirtualBox version.

djoser
Send message
Joined: 30 Aug 14
Posts: 20
Credit: 1,872,063
RAC: 783
Message 30251 - Posted: 8 May 2017, 16:38:25 UTC - in response to Message 30250.



Of course, it does:
2017-05-08 17:32:53,022 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-05-08 17:32:53,198 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-05-08 17:32:56,278 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")




At least spend more RAM for each VM.
4200 MB is standard according to the project but sometimes not enough.
Your host has enough RAM to set 5000 MB per VM via app_config.xml.

If the errors persist you may think about an update of your VirtualBox version.


There you go...how could i miss those lines? Seems to be time for a visit at the ophthalmologist.

Thanks for your suggestion, I will raise the RAM for those wu's and try again.
I don't think it's about the virtualbox version, because 4 core Atlas tasks run without any problems at all!?
____________
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us

djoser
Send message
Joined: 30 Aug 14
Posts: 20
Credit: 1,872,063
RAC: 783
Message 30276 - Posted: 10 May 2017, 17:17:49 UTC

Just for feedback:

I got it up and running.
I set project to no new work and waited for all running tasks to finish.
Then i reset the project and closed Boinc.
Afterwards i updated Virtualbox to the latest version and rebooted the system.
After creating the app_config.xml file i set the according preferences on the website and fired up Boinc again.

Since then everything works fine.

By the way: are those infos from the old Atlas forum still valid?


http://atlasathome.cern.ch/forum_thread.php?id=568
____________
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us

KarmannGaz
Send message
Joined: 27 Mar 15
Posts: 3
Credit: 244,677
RAC: 107
Message 30328 - Posted: 13 May 2017, 13:12:09 UTC

Interesting - I've been getting the same problem with several of my Atlas tasks after recently switching to 2 CPUs. Not all of them, but a lot of them. I'm already on the latest version of VirtualBox (5.1.12) according to apt-get, but will change it to use 3 CPUs instead (like I had before) and see how that goes. Thanks for the info.

Toby Broom
Volunteer moderator
Send message
Joined: 27 Sep 08
Posts: 376
Credit: 88,664,256
RAC: 174,171
Message 30329 - Posted: 13 May 2017, 13:54:34 UTC

The memory usage was tweeked a little bit:

2.6GB + 0.8GB * ncores

KarmannGaz
Send message
Joined: 27 Mar 15
Posts: 3
Credit: 244,677
RAC: 107
Message 30331 - Posted: 13 May 2017, 20:44:02 UTC - in response to Message 30328.

I switched back to 3 CPUs and tasks seem to be working ok again now. Thanks! :-)

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 30332 - Posted: 14 May 2017, 6:02:49 UTC - in response to Message 30331.

IIRC David Cameron wrote in one of his posts that the recent RAM formula may lead to not enough RAM for VMs with a low CPU number, i.e. 1 or 2.
You may test a 2-core VM with 5000 MB RAM (or slightly below) in your app_config.xml.

Your 3-core setup seems to run ok but there are errors in the logfile and a HITS-file is missing:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=139910590

2017-05-14 04:35:50 (26083): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_16174_1494731786/PandaJob_3377360315_1494731796/athena_stdout.txt -
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe.preExecute 2017-05-14 04:17:47,259 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe.preExecute 2017-05-14 04:17:47,263 INFO Now writing wrapper for substep executor EVNTtoHITS
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-05-14 04:17:47,263 INFO Valgrind not engaged
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe.preExecute 2017-05-14 04:17:47,264 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe.execute 2017-05-14 04:17:47,264 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe.execute 2017-05-14 04:33:13,347 INFO EVNTtoHITS executor returns 65
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe.validate 2017-05-14 04:33:14,272 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.trfExe.validate 2017-05-14 04:33:14,372 INFO Scanning logfile log.EVNTtoHITS for errors
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.transform.execute 2017-05-14 04:33:15,162 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-05-14 04:35:50 (26083): Guest Log: PyJobTransforms.transform.execute 2017-05-14 04:33:18,624 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")

KarmannGaz
Send message
Joined: 27 Mar 15
Posts: 3
Credit: 244,677
RAC: 107
Message 30342 - Posted: 14 May 2017, 23:38:46 UTC - in response to Message 30332.

there are errors in the logfile and a HITS-file is missing


I hadn't spotted that, though this seems to be just in the guest log messages, not anything wrong with my setup, and I can't see any similar messages in later tasks, so it could just be a one-off. Also, that task went on to complete ok :
2017-05-14 04:35:52 (26083): Guest Log: Successfully finished the ATLAS job!

I've altered the preferences to use all 4 of my CPUs now (I was only throttling it to 3 to let some other tasks get through at the same time). These seem fine as well, all are now getting successful finishes and validating ok, so I'm happy anyway. :-)

Erich56
Send message
Joined: 18 Dec 15
Posts: 383
Credit: 3,873,774
RAC: 7,567
Message 30518 - Posted: 27 May 2017, 8:47:00 UTC

I just had a failed 2-core task.

Excerpt from Stderr:

PyJobTransforms.trfExe.validate 2017-05-27 09:48:01,962 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)

complete report can be seen here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=144126913

any idea what is the Problem?

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 120
Credit: 6,749,027
RAC: 20,218
Message 30550 - Posted: 29 May 2017, 18:46:23 UTC - in response to Message 30518.

I have seen this error in your log:

FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider

I understand that this is due to insufficient memory allocated to the VM. I have had a few of those in the past and actually increased the memory to 7000 for 2-core, which seems to have reduced the frequency of this type of error.
____________
We are the product of random evolution.

Message boards : ATLAS application : Atlas 2 core tasks fail with validation error