log in

Some Validate errors


Advanced search

Message boards : ATLAS application : Some Validate errors

1 · 2 · Next
Author Message
Toby Broom
Volunteer moderator
Send message
Joined: 27 Sep 08
Posts: 376
Credit: 88,664,173
RAC: 174,222
Message 29568 - Posted: 23 Mar 2017, 18:07:20 UTC

I got some validate errors today

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127828797

ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (33) (Error code 65)

https://lhcathome.cern.ch/lhcathome/result.php?resultid=127824383

Erich56
Send message
Joined: 18 Dec 15
Posts: 383
Credit: 3,873,774
RAC: 7,567
Message 29749 - Posted: 2 Apr 2017, 5:50:56 UTC
Last modified: 2 Apr 2017, 6:07:42 UTC

within the past hour, I have got several WUs which errored out after some 11-12 minutes with a Validation error.

For example:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=131474610

what catches my eye in the stderr:

"mv: cannot stat `metadata.xml': No such file or Directory"

Erich56
Send message
Joined: 18 Dec 15
Posts: 383
Credit: 3,873,774
RAC: 7,567
Message 29755 - Posted: 2 Apr 2017, 7:50:17 UTC

by now, there are some 15 such tasks which errored out with validation error.

another example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=131531536

is there something wrong with the WU, or is the fault with my PC?

No one else experiencing the same problem?

Profile Yeti
Volunteer moderator
Avatar
Send message
Joined: 2 Sep 04
Posts: 303
Credit: 42,190,797
RAC: 5,108
Message 29757 - Posted: 2 Apr 2017, 8:08:14 UTC

Yeah, seems to be some kind of faulty WUs ...

Have round about 40 or more of them
____________


Supporting BOINC, a great concept !

Profile HerveUAE
Avatar
Send message
Joined: 18 Dec 16
Posts: 120
Credit: 6,749,027
RAC: 20,218
Message 29760 - Posted: 2 Apr 2017, 8:47:11 UTC

I have the same problem, started this morning as well.
____________
We are the product of random evolution.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,809
RAC: 2,011
Message 29762 - Posted: 2 Apr 2017, 8:53:02 UTC

Looks like a database connection can't be established

FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP

Maybe it's best to choose another sub-project until this is solved.

BRG
Send message
Joined: 23 Dec 16
Posts: 26
Credit: 776,007
RAC: 0
Message 29775 - Posted: 2 Apr 2017, 15:51:12 UTC - in response to Message 29757.

Yeah, seems to be some kind of faulty WUs ...

Have round about 40 or more of them


yup, you me both... did the config file thing yesterday and all was working well... this morning come to the shop and find pages and pages of error tasks :(

Erich56
Send message
Joined: 18 Dec 15
Posts: 383
Credit: 3,873,774
RAC: 7,567
Message 29776 - Posted: 2 Apr 2017, 19:18:42 UTC

there was a similar problem with CMS tasks this noon.

Ivan, the CMS moderator, wrote:

The WMAgent server has fallen over at CERN. Please set no new tasks until I can raise someone to fix it.

And in fact, the problem was fixed a few hours later.
So I hoped that the current ATLAS problem may have the same origin and would also be solved already.
But unfortunately, it still prevails :-(

Profile Opolis
Send message
Joined: 13 May 15
Posts: 2
Credit: 281,060
RAC: 0
Message 29791 - Posted: 3 Apr 2017, 15:04:43 UTC

Over 100 validate errors until I shut it down.
Randomly checking on task logs, every one I check shows successful completion but only about half the run time I would expect.

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 29792 - Posted: 3 Apr 2017, 15:26:46 UTC - in response to Message 29791.

Over 100 validate errors until I shut it down.
Randomly checking on task logs, every one I check shows successful completion but only about half the run time I would expect.
Looks like you were affected by the system outage at CERN.
Should have been fixed this morning so you may try again som WUs.

Erich56
Send message
Joined: 18 Dec 15
Posts: 383
Credit: 3,873,774
RAC: 7,567
Message 29793 - Posted: 3 Apr 2017, 15:27:00 UTC

hm, after I had read in another thread here that things should be okay now, I downloaded several new WUs, all of which seem to run okay. One of them just got finished and uploaded, and I received credit for it.

Profile Opolis
Send message
Joined: 13 May 15
Posts: 2
Credit: 281,060
RAC: 0
Message 29824 - Posted: 4 Apr 2017, 16:47:11 UTC

Yes, it must have been due to the outage. Tasks are running and validating just fine now.

Brummig
Avatar
Send message
Joined: 9 Feb 16
Posts: 18
Credit: 218,774
RAC: 274
Message 29825 - Posted: 5 Apr 2017, 10:43:10 UTC - in response to Message 29824.
Last modified: 5 Apr 2017, 10:43:57 UTC

I'm experiencing problems with validation too:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=132281796

The critical entries in the stderr output appear to be:
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.trfExe.validate 2017-04-05 08:53:41,857 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.trfExe.validate 2017-04-05 08:53:41,955 INFO Scanning logfile log.EVNTtoHITS for errors
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.transform.execute 2017-04-05 08:53:42,285 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-04-05 08:55:57 (1120): Guest Log: PyJobTransforms.transform.execute 2017-04-05 08:53:46,855 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 29826 - Posted: 5 Apr 2017, 11:05:23 UTC - in response to Message 29825.

From your log:

2017-04-05 08:35:03 (1120): Setting Memory Size for VM. (4000MB)
2017-04-05 08:35:03 (1120): Setting CPU Count for VM. (2)

You may increase the RAM value to at least 4200 MB (project min request for 2 CPUs) or better 4600-5000 MB.

Brummig
Avatar
Send message
Joined: 9 Feb 16
Posts: 18
Credit: 218,774
RAC: 274
Message 29827 - Posted: 5 Apr 2017, 11:22:06 UTC - in response to Message 29826.

OK, thanks, I'll try that. So the recommended value of 1.6 + 1 * ncores is wrong?

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,271
RAC: 1,830
Message 29828 - Posted: 5 Apr 2017, 11:45:20 UTC - in response to Message 29827.

OK, thanks, I'll try that. So the recommended value of 1.6 + 1 * ncores is wrong?

I may be wrong but this seems to be David Cameron´s most recent post regarding the RAM formula.
The 4200 MB also correspond to the value that is set on the server´s app template.

Nonetheless it is not too bad if you set the RAM value a bit higher.

If the errors still occur it may be because of a CERN internal error.

Jim1348
Send message
Joined: 15 Nov 14
Posts: 86
Credit: 3,721,688
RAC: 14,000
Message 29830 - Posted: 5 Apr 2017, 14:39:19 UTC

The ATLAS are now running (and validating) OK for me too. I am running a single core at a time now to get the highest CPU efficiency, 91%.
https://lhcathome.cern.ch/lhcathome/results.php?userid=437988&offset=0&show_names=0&state=0&appid=14

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,809
RAC: 2,011
Message 29831 - Posted: 5 Apr 2017, 15:46:11 UTC - in response to Message 29830.

The ATLAS are now running (and validating) OK for me too. I am running a single core at a time now to get the highest CPU efficiency, 91%.
https://lhcathome.cern.ch/lhcathome/results.php?userid=437988&offset=0&show_names=0&state=0&appid=14

Sorry Jim, your link is not clickable without be logged in as Jim1348.
Clickable are links with your hostid's like https://lhcathome.cern.ch/lhcathome/results.php?hostid=10413980&offset=0&show_names=0&state=4&appid=14

Jim1348
Send message
Joined: 15 Nov 14
Posts: 86
Credit: 3,721,688
RAC: 14,000
Message 29832 - Posted: 5 Apr 2017, 16:26:13 UTC - in response to Message 29831.

Sorry Jim, your link is not clickable without be logged in as Jim1348.
Clickable are links with your hostid's

Yes, I missed that this time: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10413980&offset=0&show_names=0&state=0&appid=14

PHILIPPE
Send message
Joined: 24 Jul 16
Posts: 65
Credit: 128,142
RAC: 434
Message 29848 - Posted: 6 Apr 2017, 20:22:35 UTC - in response to Message 29832.
Last modified: 6 Apr 2017, 20:26:47 UTC

Hi , Jim , when i watch at your task list , i notice that some of them finished at the same time curiously :

132654702 63944170 5 Apr 2017, 17:49:53 UTC 6 Apr 2017, 11:35:26 UTC Terminé et validé 14,807.79 13,672.23 529.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132635670 63934356 5 Apr 2017, 15:21:38 UTC 6 Apr 2017, 11:35:26 UTC Terminé et validé 14,406.76 13,282.49 518.60 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132635135 63934056 5 Apr 2017, 15:21:38 UTC 6 Apr 2017, 17:54:55 UTC Terminé et validé 45,935.68 43,993.34 309.92 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132612295 63922299 5 Apr 2017, 12:20:46 UTC 6 Apr 2017, 16:54:38 UTC Terminé et validé 43,920.01 42,135.72 297.01 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132599575 63899055 5 Apr 2017, 10:16:51 UTC 6 Apr 2017, 5:46:26 UTC Terminé et validé 12,353.49 11,251.34 450.67 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132599376 63915719 5 Apr 2017, 10:16:51 UTC 6 Apr 2017, 15:38:28 UTC Terminé et validé 47,240.79 45,292.18 320.46 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132588880 63910277 5 Apr 2017, 8:17:21 UTC 6 Apr 2017, 19:02:35 UTC Terminé et validé 64,214.80 42,583.39 434.65 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132587103 63909363 5 Apr 2017, 8:17:21 UTC 6 Apr 2017, 5:46:26 UTC Terminé et validé 12,611.74 11,344.23 457.74 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132578618 63905117 5 Apr 2017, 6:55:27 UTC 6 Apr 2017, 15:39:43 UTC Terminé et validé 64,315.95 13,886.79 436.85 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132573375 63902502 5 Apr 2017, 6:55:27 UTC 6 Apr 2017, 2:30:55 UTC Terminé et validé 14,328.01 12,910.20 515.87 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132569466 63900553 5 Apr 2017, 5:40:13 UTC 6 Apr 2017, 6:30:32 UTC Terminé et validé 45,065.16 42,539.11 308.42 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132569276 63900464 5 Apr 2017, 5:40:13 UTC 5 Apr 2017, 22:51:25 UTC Terminé et validé 13,434.33 12,089.04 479.66 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132569278 63900466 5 Apr 2017, 5:40:13 UTC 6 Apr 2017, 1:15:48 UTC Terminé et validé 14,392.85 13,116.73 516.11 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132566446 63899080 5 Apr 2017, 5:15:48 UTC 6 Apr 2017, 6:42:26 UTC Terminé et validé 46,350.91 44,009.85 317.85 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132566287 63899020 5 Apr 2017, 5:08:49 UTC 5 Apr 2017, 19:07:08 UTC Terminé et validé 11,492.36 10,493.09 395.03 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132549193 63890631 5 Apr 2017, 3:54:35 UTC 5 Apr 2017, 15:21:37 UTC Terminé et validé 55.56 52.81 1.53 SixTrack v451.07 (sse2)
i686-pc-linux-gnu
132547873 63889984 5 Apr 2017, 3:46:05 UTC 5 Apr 2017, 15:21:37 UTC Terminé et validé 12,837.81 11,621.59 426.30 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132555825 63893849 5 Apr 2017, 3:46:05 UTC 6 Apr 2017, 2:30:55 UTC Terminé et validé 44,026.30 41,682.92 299.40 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132553737 63892822 5 Apr 2017, 3:45:49 UTC 5 Apr 2017, 21:44:06 UTC Terminé et validé 44,860.11 42,364.66 304.73 Theory Simulation v262.70 (vbox64)
x86_64-pc-linux-gnu
132542632 63887447 5 Apr 2017, 3:45:49 UTC 5 Apr 2017, 15:21:37 UTC Terminé et validé 15,490.50 14,331.22 519.43 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu

So i decided to watch david cameron 's wus and apparently this is the same behaviour :
132850245 64044617 6 Apr 2017, 18:26:27 UTC 6 Apr 2017, 19:15:24 UTC Terminé et validé 2,921.55 9,505.75 454.03 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
130289199 62760931 6 Apr 2017, 15:13:54 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,997.87 9,932.30 468.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132846066 64041842 6 Apr 2017, 13:56:43 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,874.72 9,307.01 453.21 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132839563 64038431 6 Apr 2017, 12:46:38 UTC 6 Apr 2017, 16:59:28 UTC Terminé et validé 2,992.85 9,781.89 475.07 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132828465 64032743 6 Apr 2017, 11:33:34 UTC 6 Apr 2017, 16:59:28 UTC Terminé et validé 3,415.75 11,102.41 546.87 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132821394 64029108 6 Apr 2017, 10:29:35 UTC 6 Apr 2017, 15:01:30 UTC Terminé et validé 3,093.12 9,947.54 502.86 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132821146 64028921 6 Apr 2017, 9:39:26 UTC 6 Apr 2017, 13:56:43 UTC Terminé et validé 3,401.90 11,292.40 557.92 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132808547 64022529 6 Apr 2017, 8:45:56 UTC 6 Apr 2017, 12:46:37 UTC Terminé et validé 3,526.84 11,486.71 591.55 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132801592 64018929 6 Apr 2017, 7:43:45 UTC 6 Apr 2017, 11:33:34 UTC Terminé et validé 3,066.11 10,210.83 511.91 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132788198 64012117 6 Apr 2017, 6:46:54 UTC 6 Apr 2017, 10:29:34 UTC Terminé et validé 3,006.48 9,628.55 498.87 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu


Some wus didn't last the same time but finished at the same time.
I thought , there was perhaps a reboot but it doesn't seem.
Why the wus finished at the same time, without the same elapsed times ?
I thought they were independent...
Is there a reason , i can't see why in the logs.

For david's wu :
130289199 62760931 6 Apr 2017, 15:13:54 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,997.87 9,932.30 468.08 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu
132846066 64041842 6 Apr 2017, 13:56:43 UTC 6 Apr 2017, 18:26:27 UTC Terminé et validé 2,874.72 9,307.01 453.21 ATLAS Simulation v1.01 (vbox64_mt_mcore_atlas)
x86_64-pc-linux-gnu

they ran in the same slot 0 ? :
2017-04-06 19:36:20 (3757857): vboxwrapper (7.7.26196): starting
2017-04-06 19:36:21 (3757857): Feature: Checkpoint interval offset (98 seconds)
2017-04-06 19:36:21 (3757857): Detected: VirtualBox VboxManage Interface (Version: 5.1.2)
2017-04-06 19:36:21 (3757857): Detected: Minimum checkpoint interval (900.000000 seconds)
2017-04-06 19:36:21 (3757857): Successfully copied 'init_data.xml' to the shared directory.
2017-04-06 19:36:22 (3757857): Create VM. (boinc_697dfe6da1513150, slot#0)

2017-04-06 20:26:11 (3757857): Removing virtual disk drive from VirtualBox.
20:26:17 (3757857): called boinc_finish(0)

</stderr_txt>
]]>

2017-04-06 18:48:23 (3661182): vboxwrapper (7.7.26196): starting
2017-04-06 18:48:24 (3661182): Feature: Checkpoint interval offset (519 seconds)
2017-04-06 18:48:24 (3661182): Detected: VirtualBox VboxManage Interface (Version: 5.1.2)
2017-04-06 18:48:24 (3661182): Detected: Minimum checkpoint interval (900.000000 seconds)
2017-04-06 18:48:24 (3661182): Successfully copied 'init_data.xml' to the shared directory.
2017-04-06 18:48:24 (3661182): Create VM. (boinc_7e2c9e1a06014ade, slot#0)

2017-04-06 19:36:11 (3661182): Removing virtual disk drive from VirtualBox.
19:36:16 (3661182): called boinc_finish(0)

</stderr_txt>
]]>

And more strange the finish time is not the same in the log and in the task list.

Are there bugs in the site ?

1 · 2 · Next

Message boards : ATLAS application : Some Validate errors