Thread 'New app version 1.01'

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1551 Credit: 10,067,673 RAC: 887	Message 29199 - Posted: 12 Mar 2017, 6:36:54 UTC All mentioned error tasks have the same FATAL condition: database connection COOLOFL_LAR/OFLP200 cannot be opened - STOP ID: 29199 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29200 - Posted: 12 Mar 2017, 8:52:50 UTC database connection COOLOFL_LAR/OFLP200 cannot be opened Well spotted! So does it mean that, indeed, some port needs to be allowed through my firewall / router / ISP provider? ID: 29200 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,873,857 RAC: 18,686	Message 29201 - Posted: 12 Mar 2017, 9:02:52 UTC - in response to Message 29200. Last modified: 12 Mar 2017, 9:03:12 UTC database connection COOLOFL_LAR/OFLP200 cannot be opened Well spotted! So does it mean that, indeed, some port needs to be allowed through my firewall / router / ISP provider? Yes, you have to do this, see http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use But this error seems not to be a problem on your side, as I have finished something with 60x 1.01 tasks succesfull, but 6 with failed. In this failed ones I can find this error too so it seems to be a problem of the Outside-Server or some Config-Parameters of these WUs Not all failed WUs have this error code: database connection COOLOFL_LAR/OFLP200 cannot be opened Supporting BOINC, a great concept ! ID: 29201 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1551 Credit: 10,067,673 RAC: 887	Message 29206 - Posted: 12 Mar 2017, 11:50:58 UTC Another workunit reached the state Too many errors (may have bug) https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60039893 The errors of 4 tasks: - DetectorStore FATAL in sysInitialize(): standard std::exception is caught - IOVDbSvc FATAL Conditions database connection COOLOFL_LAR/OFLP200 cannot be opened - STOP - IOVDbSvc FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP - DetectorStore FATAL in sysInitialize(): standard std::exception is caught ID: 29206 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29216 - Posted: 13 Mar 2017, 8:25:23 UTC - in response to Message 29206. Hi, We are suffering infrastructure problems due to tasks running on the ATLAS grid which are overloading database servers. This host ccsqfatlasli01.in2p3.fr is one of the servers that WU contact while running to get conditions data (basically data describing the geometry and status of the ATLAS detector), and this service is not working at the moment. I'm checking if we should wait for this service to be recovered or if there is an alternative one we can use. ID: 29216 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1551 Credit: 10,067,673 RAC: 887	Message 29218 - Posted: 13 Mar 2017, 9:29:54 UTC New tasks were added to the ATLAS-queue and I got 3 all ending in Validate Error. https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566663 - FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566900 - FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566914 - FATAL Conditions database connection COOLOFL_CALO/OFLP200 cannot be opened - STOP I set the client to No New Work ID: 29218 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,873,857 RAC: 18,686	Message 29221 - Posted: 13 Mar 2017, 11:51:52 UTC - in response to Message 29218. New tasks were added to the ATLAS-queue and I got 3 all ending in Validate Error. Same here Supporting BOINC, a great concept ! ID: 29221 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29223 - Posted: 13 Mar 2017, 12:44:21 UTC The issues at ATLAS@Home and here have apparently the same root cause: server overload. Maybe the team at LHC should ask the users to stop getting new tasks till those servers have recovered. We are the product of random evolution. ID: 29223 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,873,857 RAC: 18,686	Message 29225 - Posted: 13 Mar 2017, 14:08:20 UTC Something seems to have changed, I have a running Atlas-Task at 36 Minutes Supporting BOINC, a great concept ! ID: 29225 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29226 - Posted: 13 Mar 2017, 14:48:47 UTC - in response to Message 29225. It seems like the affected services were brought back online, I see a few successful WU coming in now. ID: 29226 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2292 Credit: 179,057,391 RAC: 14,681	Message 29227 - Posted: 13 Mar 2017, 15:09:15 UTC Sorry, this WU https://lhcathome.cern.ch/lhcathome/result.php?resultid=125577654 finished after 20 Min. duration and 6 min CPU: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_5901_1489415914/PandaJob_3281023545_1489415928/athena_stdout.txt - 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,082 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,086 INFO Now writing wrapper for substep executor EVNTtoHITS 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-03-13 15:40:27,086 INFO Valgrind not engaged 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,086 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh'] 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.execute 2017-03-13 15:40:27,086 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh']) 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.execute 2017-03-13 15:55:50,604 INFO EVNTtoHITS executor returns 65 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.validate 2017-03-13 15:55:51,805 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.validate 2017-03-13 15:55:51,841 INFO Scanning logfile log.EVNTtoHITS for errors 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.transform.execute 2017-03-13 15:55:52,375 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.transform.execute 2017-03-13 15:55:55,732 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") 2017-03-13 15:59:34 (4580): Guest Log: - Walltime - 2017-03-13 15:59:34 (4580): Guest Log: JobRetrival=0, StageIn=6, Execution=1023, StageOut=0, CleanUp=9 ID: 29227 · Reply Quote

Egon Olsen Send message Joined: 4 Jul 12 Posts: 2 Credit: 3,771,642 RAC: 45	Message 29231 - Posted: 13 Mar 2017, 18:16:48 UTC Last modified: 13 Mar 2017, 18:20:59 UTC I can not calculate more than 3 Wus at the same time <app_config> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>4.000000</avg_ncpus> <cmdline>--memory_size_mb 4600</cmdline> </app_version> </app_config> max memory usage when active: 117897.91MB max memory usage when idle: 129687.70MB max disk usage: 200.70GB Processor: 36 Intel(R) Xeon Memory: 127.93 GB physical, 127.93 GB virtual Disk: 232.33 GB total, 182.25 GB free Win 10 x64 Pro Boinc 7.6.33 x64 VirtualBox 5.1.14 ID: 29231 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 931 Credit: 780,968,665 RAC: 80,874	Message 29238 - Posted: 14 Mar 2017, 7:05:36 UTC - in response to Message 29231. What do you have your max number of jobs set to? If you set to 24 then you are at the max for project. ID: 29238 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 29239 - Posted: 14 Mar 2017, 8:29:21 UTC Hi all, The infrastructure problems were fixed yesterday so most WU should succeed now. Sorry for the inconvenience. ID: 29239 · Reply Quote

Egon Olsen Send message Joined: 4 Jul 12 Posts: 2 Credit: 3,771,642 RAC: 45	Message 29242 - Posted: 14 Mar 2017, 10:47:22 UTC - in response to Message 29238. What do you have your max number of jobs set to? If you set to 24 then you are at the max for project. No Limit ID: 29242 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 29243 - Posted: 14 Mar 2017, 10:55:38 UTC I completed one 1.01 but got a validate error. Tullio ID: 29243 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 29247 - Posted: 14 Mar 2017, 13:39:05 UTC I think there may still be some other issue. All ATLAS WUs still fail on my machine, and the following 2 WUs also failed on another machine: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60483240 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521 Maybe this is the root cause of a problem that lead to the servers being overloaded over the WE. We are the product of random evolution. ID: 29247 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,873,857 RAC: 18,686	Message 29248 - Posted: 14 Mar 2017, 13:58:21 UTC - in response to Message 29247. All ATLAS WUs still fail on my machine, Sorry, but then you have a problem on your machine or your network. Don't know, if you can access this list but I have already crunched a lot of Atlas 1.01 successfull since today morning: https://lhcathome.cern.ch/lhcathome/results.php?userid=555&offset=0&show_names=0&state=4&appid=14 This WU https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521 is now on one of my machines, I have pimped it up to be the next to be crunched, let's see later what has happened. If really all Atlas-WUs on your machine(s) are failing you should urgent take a walk through my old checklist: http://atlasathome.cern.ch/forum_thread.php?id=581&postid=5350#5350 Supporting BOINC, a great concept ! ID: 29248 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1551 Credit: 10,067,673 RAC: 887	Message 29249 - Posted: 14 Mar 2017, 14:05:05 UTC - in response to Message 29248. All ATLAS WUs still fail on my machine, A 3-core ATLAS did well this morning, but thereafter I wanted to run a single core and that one died early. https://lhcathome.cern.ch/lhcathome/result.php?resultid=126128594 - CRITICAL Transform executor raised TransformValidationException: EVNTtoHITS got a SIGKILL signal (exit code 137 ID: 29249 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,873,857 RAC: 18,686	Message 29250 - Posted: 14 Mar 2017, 14:13:37 UTC @All that have actual problems running Atlas where most or all WUs fail: The BOINC-Client isn't very good running these VM-Sub-Projects from this new consolidated LHC@Home. The Reasons are widely variiing but at the moment we will have to live with it. Shure, some WUs of Atlas fail but this rate should keep less than 10% So, it may help, if you suspend / de-select other VM-Subprojects for a while and test, how this works. Since David opened the gates today morning I haven't had any failed Atlas-Job (but shure some will come). I'm running Atlas only, no CSM, no Theory, no LHCb and even no Sixtrack on my Atlas-Crunchers. Give it a try and run only Atlas-Tasks for a while Supporting BOINC, a great concept ! ID: 29250 · Reply Quote