Message boards :
ATLAS application :
Some Validate errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Sep 08 Posts: 753 Credit: 573,004,600 RAC: 197,968 ![]() ![]() ![]() |
I got some validate errors today https://lhcathome.cern.ch/lhcathome/result.php?resultid=127828797 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (33) (Error code 65) https://lhcathome.cern.ch/lhcathome/result.php?resultid=127824383 |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 68,475,804 RAC: 172,349 ![]() ![]() ![]() |
within the past hour, I have got several WUs which errored out after some 11-12 minutes with a Validation error. For example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=131474610 what catches my eye in the stderr: "mv: cannot stat `metadata.xml': No such file or Directory" |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 68,475,804 RAC: 172,349 ![]() ![]() ![]() |
by now, there are some 15 such tasks which errored out with validation error. another example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=131531536 is there something wrong with the WU, or is the fault with my PC? No one else experiencing the same problem? |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 450 Credit: 173,534,958 RAC: 176,824 ![]() ![]() ![]() |
|
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
I have the same problem, started this morning as well. We are the product of random evolution. |
Send message Joined: 14 Jan 10 Posts: 1176 Credit: 7,446,407 RAC: 14,583 ![]() ![]() ![]() |
Looks like a database connection can't be established FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP Maybe it's best to choose another sub-project until this is solved. |
Send message Joined: 23 Dec 16 Posts: 26 Credit: 776,007 RAC: 0 ![]() ![]() |
Yeah, seems to be some kind of faulty WUs ... yup, you me both... did the config file thing yesterday and all was working well... this morning come to the shop and find pages and pages of error tasks :( |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 68,475,804 RAC: 172,349 ![]() ![]() ![]() |
there was a similar problem with CMS tasks this noon. Ivan, the CMS moderator, wrote: The WMAgent server has fallen over at CERN. Please set no new tasks until I can raise someone to fix it. And in fact, the problem was fixed a few hours later. So I hoped that the current ATLAS problem may have the same origin and would also be solved already. But unfortunately, it still prevails :-( |
![]() Send message Joined: 13 May 15 Posts: 2 Credit: 1,774,271 RAC: 5,873 ![]() ![]() ![]() |
Over 100 validate errors until I shut it down. Randomly checking on task logs, every one I check shows successful completion but only about half the run time I would expect. |
![]() Send message Joined: 15 Jun 08 Posts: 2184 Credit: 186,420,658 RAC: 132,000 ![]() ![]() ![]() |
Over 100 validate errors until I shut it down.Looks like you were affected by the system outage at CERN. Should have been fixed this morning so you may try again som WUs. |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 68,475,804 RAC: 172,349 ![]() ![]() ![]() |
hm, after I had read in another thread here that things should be okay now, I downloaded several new WUs, all of which seem to run okay. One of them just got finished and uploaded, and I received credit for it. |
![]() Send message Joined: 13 May 15 Posts: 2 Credit: 1,774,271 RAC: 5,873 ![]() ![]() ![]() |
Yes, it must have been due to the outage. Tasks are running and validating just fine now. |
![]() Send message Joined: 9 Feb 16 Posts: 48 Credit: 536,970 RAC: 2 ![]() ![]() |
I'm experiencing problems with validation too: https://lhcathome.cern.ch/lhcathome/result.php?resultid=132281796 The critical entries in the stderr output appear to be: 2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.trfExe.validate 2017-04-05 08:53:41,857 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.trfExe.validate 2017-04-05 08:53:41,955 INFO Scanning logfile log.EVNTtoHITS for errors 2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.transform.execute 2017-04-05 08:53:42,285 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.transform.execute 2017-04-05 08:53:46,855 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") |
![]() Send message Joined: 15 Jun 08 Posts: 2184 Credit: 186,420,658 RAC: 132,000 ![]() ![]() ![]() |
From your log: 2017-04-05 08:35:03 (1120): Setting Memory Size for VM. (4000MB) You may increase the RAM value to at least 4200 MB (project min request for 2 CPUs) or better 4600-5000 MB. |
![]() Send message Joined: 9 Feb 16 Posts: 48 Credit: 536,970 RAC: 2 ![]() ![]() |
OK, thanks, I'll try that. So the recommended value of 1.6 + 1 * ncores is wrong? |
![]() Send message Joined: 15 Jun 08 Posts: 2184 Credit: 186,420,658 RAC: 132,000 ![]() ![]() ![]() |
OK, thanks, I'll try that. So the recommended value of 1.6 + 1 * ncores is wrong? I may be wrong but this seems to be David Cameron´s most recent post regarding the RAM formula. The 4200 MB also correspond to the value that is set on the server´s app template. Nonetheless it is not too bad if you set the RAM value a bit higher. If the errors still occur it may be because of a CERN internal error. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
The ATLAS are now running (and validating) OK for me too. I am running a single core at a time now to get the highest CPU efficiency, 91%. https://lhcathome.cern.ch/lhcathome/results.php?userid=437988&offset=0&show_names=0&state=0&appid=14 |
Send message Joined: 14 Jan 10 Posts: 1176 Credit: 7,446,407 RAC: 14,583 ![]() ![]() ![]() |
The ATLAS are now running (and validating) OK for me too. I am running a single core at a time now to get the highest CPU efficiency, 91%. Sorry Jim, your link is not clickable without be logged in as Jim1348. Clickable are links with your hostid's like https://lhcathome.cern.ch/lhcathome/results.php?hostid=10413980&offset=0&show_names=0&state=4&appid=14 |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 ![]() ![]() |
Sorry Jim, your link is not clickable without be logged in as Jim1348. Yes, I missed that this time: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10413980&offset=0&show_names=0&state=0&appid=14 |
Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0 ![]() ![]() |
Hi , Jim , when i watch at your task list , i notice that some of them finished at the same time curiously : 132654702 63944170 5 Apr 2017, 17:49:53 UTC 6 Apr 2017, 11:35:26 UTC Terminé et validé 14,807.79 13,672.23 529.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas) So i decided to watch david cameron 's wus and apparently this is the same behaviour : 132850245 64044617 6 Apr 2017, 18:26:27 UTC 6 Apr 2017, 19:15:24 UTC Terminé et validé 2,921.55 9,505.75 454.03 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas) Some wus didn't last the same time but finished at the same time. I thought , there was perhaps a reboot but it doesn't seem. Why the wus finished at the same time, without the same elapsed times ? I thought they were independent... Is there a reason , i can't see why in the logs. For david's wu : 130289199 62760931 6 Apr 2017, 15:13:54 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,997.87 9,932.30 468.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas) they ran in the same slot 0 ? : 2017-04-06 19:36:20 (3757857): vboxwrapper (7.7.26196): starting 2017-04-06 20:26:11 (3757857): Removing virtual disk drive from VirtualBox. 2017-04-06 18:48:23 (3661182): vboxwrapper (7.7.26196): starting 2017-04-06 19:36:11 (3661182): Removing virtual disk drive from VirtualBox. And more strange the finish time is not the same in the log and in the task list. Are there bugs in the site ? |
©2023 CERN