Message boards : ATLAS application : Some Validate errors
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,830,757
RAC: 228,388
Message 29568 - Posted: 23 Mar 2017, 18:07:20 UTC

I got some validate errors today

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127828797

ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (33) (Error code 65)

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127824383
ID: 29568 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,448,517
RAC: 103,156
Message 29749 - Posted: 2 Apr 2017, 5:50:56 UTC
Last modified: 2 Apr 2017, 6:07:42 UTC

within the past hour, I have got several WUs which errored out after some 11-12 minutes with a Validation error.

For example:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=131474610

what catches my eye in the stderr:

"mv: cannot stat `metadata.xml': No such file or Directory"
ID: 29749 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,448,517
RAC: 103,156
Message 29755 - Posted: 2 Apr 2017, 7:50:17 UTC

by now, there are some 15 such tasks which errored out with validation error.

another example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=131531536

is there something wrong with the WU, or is the fault with my PC?

No one else experiencing the same problem?
ID: 29755 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 29757 - Posted: 2 Apr 2017, 8:08:14 UTC

Yeah, seems to be some kind of faulty WUs ...

Have round about 40 or more of them


Supporting BOINC, a great concept !
ID: 29757 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29760 - Posted: 2 Apr 2017, 8:47:11 UTC

I have the same problem, started this morning as well.
We are the product of random evolution.
ID: 29760 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,637
RAC: 1,939
Message 29762 - Posted: 2 Apr 2017, 8:53:02 UTC

Looks like a database connection can't be established

FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP

Maybe it's best to choose another sub-project until this is solved.
ID: 29762 · Report as offensive     Reply Quote
BRG

Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 29775 - Posted: 2 Apr 2017, 15:51:12 UTC - in response to Message 29757.  

Yeah, seems to be some kind of faulty WUs ...

Have round about 40 or more of them


yup, you me both... did the config file thing yesterday and all was working well... this morning come to the shop and find pages and pages of error tasks :(
ID: 29775 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,448,517
RAC: 103,156
Message 29776 - Posted: 2 Apr 2017, 19:18:42 UTC

there was a similar problem with CMS tasks this noon.

Ivan, the CMS moderator, wrote:
The WMAgent server has fallen over at CERN. Please set no new tasks until I can raise someone to fix it.

And in fact, the problem was fixed a few hours later.
So I hoped that the current ATLAS problem may have the same origin and would also be solved already.
But unfortunately, it still prevails :-(
ID: 29776 · Report as offensive     Reply Quote
Profile Opolis

Send message
Joined: 13 May 15
Posts: 2
Credit: 2,373,692
RAC: 2,635
Message 29791 - Posted: 3 Apr 2017, 15:04:43 UTC

Over 100 validate errors until I shut it down.
Randomly checking on task logs, every one I check shows successful completion but only about half the run time I would expect.
ID: 29791 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,419
RAC: 136,105
Message 29792 - Posted: 3 Apr 2017, 15:26:46 UTC - in response to Message 29791.  

Over 100 validate errors until I shut it down.
Randomly checking on task logs, every one I check shows successful completion but only about half the run time I would expect.
Looks like you were affected by the system outage at CERN.
Should have been fixed this morning so you may try again som WUs.
ID: 29792 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,448,517
RAC: 103,156
Message 29793 - Posted: 3 Apr 2017, 15:27:00 UTC

hm, after I had read in another thread here that things should be okay now, I downloaded several new WUs, all of which seem to run okay. One of them just got finished and uploaded, and I received credit for it.
ID: 29793 · Report as offensive     Reply Quote
Profile Opolis

Send message
Joined: 13 May 15
Posts: 2
Credit: 2,373,692
RAC: 2,635
Message 29824 - Posted: 4 Apr 2017, 16:47:11 UTC

Yes, it must have been due to the outage. Tasks are running and validating just fine now.
ID: 29824 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 29825 - Posted: 5 Apr 2017, 10:43:10 UTC - in response to Message 29824.  
Last modified: 5 Apr 2017, 10:43:57 UTC

I'm experiencing problems with validation too:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=132281796

The critical entries in the stderr output appear to be:
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.trfExe.validate 2017-04-05 08:53:41,857 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.trfExe.validate 2017-04-05 08:53:41,955 INFO Scanning logfile log.EVNTtoHITS for errors
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.transform.execute 2017-04-05 08:53:42,285 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.transform.execute 2017-04-05 08:53:46,855 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")
ID: 29825 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,419
RAC: 136,105
Message 29826 - Posted: 5 Apr 2017, 11:05:23 UTC - in response to Message 29825.  

From your log:
2017-04-05 08:35:03 (1120): Setting Memory Size for VM. (4000MB)
2017-04-05 08:35:03 (1120): Setting CPU Count for VM. (2)

You may increase the RAM value to at least 4200 MB (project min request for 2 CPUs) or better 4600-5000 MB.
ID: 29826 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 29827 - Posted: 5 Apr 2017, 11:22:06 UTC - in response to Message 29826.  

OK, thanks, I'll try that. So the recommended value of 1.6 + 1 * ncores is wrong?
ID: 29827 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,003,419
RAC: 136,105
Message 29828 - Posted: 5 Apr 2017, 11:45:20 UTC - in response to Message 29827.  

OK, thanks, I'll try that. So the recommended value of 1.6 + 1 * ncores is wrong?

I may be wrong but this seems to be David Cameron´s most recent post regarding the RAM formula.
The 4200 MB also correspond to the value that is set on the server´s app template.

Nonetheless it is not too bad if you set the RAM value a bit higher.

If the errors still occur it may be because of a CERN internal error.
ID: 29828 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 29830 - Posted: 5 Apr 2017, 14:39:19 UTC

The ATLAS are now running (and validating) OK for me too. I am running a single core at a time now to get the highest CPU efficiency, 91%.
https://lhcathome.cern.ch/lhcathome/results.php?userid=437988&offset=0&show_names=0&state=0&appid=14
ID: 29830 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,637
RAC: 1,939
Message 29831 - Posted: 5 Apr 2017, 15:46:11 UTC - in response to Message 29830.  

The ATLAS are now running (and validating) OK for me too. I am running a single core at a time now to get the highest CPU efficiency, 91%.
https://lhcathome.cern.ch/lhcathome/results.php?userid=437988&offset=0&show_names=0&state=0&appid=14

Sorry Jim, your link is not clickable without be logged in as Jim1348.
Clickable are links with your hostid's like https://lhcathome.cern.ch/lhcathome/results.php?hostid=10413980&offset=0&show_names=0&state=4&appid=14
ID: 29831 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 29832 - Posted: 5 Apr 2017, 16:26:13 UTC - in response to Message 29831.  

Sorry Jim, your link is not clickable without be logged in as Jim1348.
Clickable are links with your hostid's

Yes, I missed that this time: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10413980&offset=0&show_names=0&state=0&appid=14
ID: 29832 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 29848 - Posted: 6 Apr 2017, 20:22:35 UTC - in response to Message 29832.  
Last modified: 6 Apr 2017, 20:26:47 UTC

Hi , Jim , when i watch at your task list , i notice that some of them finished at the same time curiously :
132654702 63944170 5 Apr 2017, 17:49:53 UTC 6 Apr 2017, 11:35:26 UTC Terminé et validé 14,807.79 13,672.23 529.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132635670 63934356 5 Apr 2017, 15:21:38 UTC 6 Apr 2017, 11:35:26 UTC Terminé et validé 14,406.76 13,282.49 518.60 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132635135 63934056 5 Apr 2017, 15:21:38 UTC 6 Apr 2017, 17:54:55 UTC Terminé et validé 45,935.68 43,993.34 309.92 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132612295 63922299 5 Apr 2017, 12:20:46 UTC 6 Apr 2017, 16:54:38 UTC Terminé et validé 43,920.01 42,135.72 297.01 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132599575 63899055 5 Apr 2017, 10:16:51 UTC 6 Apr 2017, 5:46:26 UTC Terminé et validé 12,353.49 11,251.34 450.67 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132599376 63915719 5 Apr 2017, 10:16:51 UTC 6 Apr 2017, 15:38:28 UTC Terminé et validé 47,240.79 45,292.18 320.46 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132588880 63910277 5 Apr 2017, 8:17:21 UTC 6 Apr 2017, 19:02:35 UTC Terminé et validé 64,214.80 42,583.39 434.65 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132587103 63909363 5 Apr 2017, 8:17:21 UTC 6 Apr 2017, 5:46:26 UTC Terminé et validé 12,611.74 11,344.23 457.74 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132578618 63905117 5 Apr 2017, 6:55:27 UTC 6 Apr 2017, 15:39:43 UTC Terminé et validé 64,315.95 13,886.79 436.85 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132573375 63902502 5 Apr 2017, 6:55:27 UTC 6 Apr 2017, 2:30:55 UTC Terminé et validé 14,328.01 12,910.20 515.87 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132569466 63900553 5 Apr 2017, 5:40:13 UTC 6 Apr 2017, 6:30:32 UTC Terminé et validé 45,065.16 42,539.11 308.42 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132569276 63900464 5 Apr 2017, 5:40:13 UTC 5 Apr 2017, 22:51:25 UTC Terminé et validé 13,434.33 12,089.04 479.66 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132569278 63900466 5 Apr 2017, 5:40:13 UTC 6 Apr 2017, 1:15:48 UTC Terminé et validé 14,392.85 13,116.73 516.11 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132566446 63899080 5 Apr 2017, 5:15:48 UTC 6 Apr 2017, 6:42:26 UTC Terminé et validé 46,350.91 44,009.85 317.85 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132566287 63899020 5 Apr 2017, 5:08:49 UTC 5 Apr 2017, 19:07:08 UTC Terminé et validé 11,492.36 10,493.09 395.03 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132549193 63890631 5 Apr 2017, 3:54:35 UTC 5 Apr 2017, 15:21:37 UTC Terminé et validé 55.56 52.81 1.53 SixTrack v451.07 (sse2)
i686-pc-linux-gnu
132547873 63889984 5 Apr 2017, 3:46:05 UTC 5 Apr 2017, 15:21:37 UTC Terminé et validé 12,837.81 11,621.59 426.30 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132555825 63893849 5 Apr 2017, 3:46:05 UTC 6 Apr 2017, 2:30:55 UTC Terminé et validé 44,026.30 41,682.92 299.40 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132553737 63892822 5 Apr 2017, 3:45:49 UTC 5 Apr 2017, 21:44:06 UTC Terminé et validé 44,860.11 42,364.66 304.73 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132542632 63887447 5 Apr 2017, 3:45:49 UTC 5 Apr 2017, 15:21:37 UTC Terminé et validé 15,490.50 14,331.22 519.43 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu

So i decided to watch david cameron 's wus and apparently this is the same behaviour :
132850245 64044617 6 Apr 2017, 18:26:27 UTC 6 Apr 2017, 19:15:24 UTC Terminé et validé 2,921.55 9,505.75 454.03 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
130289199 62760931 6 Apr 2017, 15:13:54 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,997.87 9,932.30 468.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132846066 64041842 6 Apr 2017, 13:56:43 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,874.72 9,307.01 453.21 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132839563 64038431 6 Apr 2017, 12:46:38 UTC 6 Apr 2017, 16:59:28 UTC Terminé et validé 2,992.85 9,781.89 475.07 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132828465 64032743 6 Apr 2017, 11:33:34 UTC 6 Apr 2017, 16:59:28 UTC Terminé et validé 3,415.75 11,102.41 546.87 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132821394 64029108 6 Apr 2017, 10:29:35 UTC 6 Apr 2017, 15:01:30 UTC Terminé et validé 3,093.12 9,947.54 502.86 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132821146 64028921 6 Apr 2017, 9:39:26 UTC 6 Apr 2017, 13:56:43 UTC Terminé et validé 3,401.90 11,292.40 557.92 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132808547 64022529 6 Apr 2017, 8:45:56 UTC 6 Apr 2017, 12:46:37 UTC Terminé et validé 3,526.84 11,486.71 591.55 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132801592 64018929 6 Apr 2017, 7:43:45 UTC 6 Apr 2017, 11:33:34 UTC Terminé et validé 3,066.11 10,210.83 511.91 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132788198 64012117 6 Apr 2017, 6:46:54 UTC 6 Apr 2017, 10:29:34 UTC Terminé et validé 3,006.48 9,628.55 498.87 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu


Some wus didn't last the same time but finished at the same time.
I thought , there was perhaps a reboot but it doesn't seem.
Why the wus finished at the same time, without the same elapsed times ?
I thought they were independent...
Is there a reason , i can't see why in the logs.

For david's wu :
130289199 62760931 6 Apr 2017, 15:13:54 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,997.87 9,932.30 468.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132846066 64041842 6 Apr 2017, 13:56:43 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,874.72 9,307.01 453.21 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu

they ran in the same slot 0 ? :
2017-04-06 19:36:20 (3757857): vboxwrapper (7.7.26196): starting
2017-04-06 19:36:21 (3757857): Feature: Checkpoint interval offset (98 seconds)
2017-04-06 19:36:21 (3757857): Detected: VirtualBox VboxManage Interface (Version: 5.1.2)
2017-04-06 19:36:21 (3757857): Detected: Minimum checkpoint interval (900.000000 seconds)
2017-04-06 19:36:21 (3757857): Successfully copied 'init_data.xml' to the shared directory.
2017-04-06 19:36:22 (3757857): Create VM. (boinc_697dfe6da1513150, slot#0)

2017-04-06 20:26:11 (3757857): Removing virtual disk drive from VirtualBox.
20:26:17 (3757857): called boinc_finish(0)

</stderr_txt>
]]>

2017-04-06 18:48:23 (3661182): vboxwrapper (7.7.26196): starting
2017-04-06 18:48:24 (3661182): Feature: Checkpoint interval offset (519 seconds)
2017-04-06 18:48:24 (3661182): Detected: VirtualBox VboxManage Interface (Version: 5.1.2)
2017-04-06 18:48:24 (3661182): Detected: Minimum checkpoint interval (900.000000 seconds)
2017-04-06 18:48:24 (3661182): Successfully copied 'init_data.xml' to the shared directory.
2017-04-06 18:48:24 (3661182): Create VM. (boinc_7e2c9e1a06014ade, slot#0)

2017-04-06 19:36:11 (3661182): Removing virtual disk drive from VirtualBox.
19:36:16 (3661182): called boinc_finish(0)

</stderr_txt>
]]>

And more strange the finish time is not the same in the log and in the task list.

Are there bugs in the site ?
ID: 29848 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : ATLAS application : Some Validate errors


©2024 CERN