Message boards :
ATLAS application :
New app version 1.01
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,652 RAC: 2,067 |
All mentioned error tasks have the same FATAL condition: database connection COOLOFL_LAR/OFLP200 cannot be opened - STOP |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
database connection COOLOFL_LAR/OFLP200 cannot be opened Well spotted! So does it mean that, indeed, some port needs to be allowed through my firewall / router / ISP provider? |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 10,128 |
database connection COOLOFL_LAR/OFLP200 cannot be opened Yes, you have to do this, see http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use But this error seems not to be a problem on your side, as I have finished something with 60x 1.01 tasks succesfull, but 6 with failed. In this failed ones I can find this error too so it seems to be a problem of the Outside-Server or some Config-Parameters of these WUs Not all failed WUs have this error code: database connection COOLOFL_LAR/OFLP200 cannot be opened Supporting BOINC, a great concept ! |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,652 RAC: 2,067 |
Another workunit reached the state Too many errors (may have bug) https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60039893 The errors of 4 tasks: - DetectorStore FATAL in sysInitialize(): standard std::exception is caught - IOVDbSvc FATAL Conditions database connection COOLOFL_LAR/OFLP200 cannot be opened - STOP - IOVDbSvc FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP - DetectorStore FATAL in sysInitialize(): standard std::exception is caught |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Hi, We are suffering infrastructure problems due to tasks running on the ATLAS grid which are overloading database servers. This host ccsqfatlasli01.in2p3.fr is one of the servers that WU contact while running to get conditions data (basically data describing the geometry and status of the ATLAS detector), and this service is not working at the moment. I'm checking if we should wait for this service to be recovered or if there is an alternative one we can use. |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,652 RAC: 2,067 |
New tasks were added to the ATLAS-queue and I got 3 all ending in Validate Error. https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566663 - FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566900 - FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566914 - FATAL Conditions database connection COOLOFL_CALO/OFLP200 cannot be opened - STOP I set the client to No New Work |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 10,128 |
|
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
The issues at ATLAS@Home and here have apparently the same root cause: server overload. Maybe the team at LHC should ask the users to stop getting new tasks till those servers have recovered. We are the product of random evolution. |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 10,128 |
|
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
It seems like the affected services were brought back online, I see a few successful WU coming in now. |
Send message Joined: 2 May 07 Posts: 2099 Credit: 159,815,978 RAC: 139,751 |
Sorry, this WU https://lhcathome.cern.ch/lhcathome/result.php?resultid=125577654 finished after 20 Min. duration and 6 min CPU: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_5901_1489415914/PandaJob_3281023545_1489415928/athena_stdout.txt - 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,082 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,086 INFO Now writing wrapper for substep executor EVNTtoHITS 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-03-13 15:40:27,086 INFO Valgrind not engaged 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,086 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh'] 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.execute 2017-03-13 15:40:27,086 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh']) 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.execute 2017-03-13 15:55:50,604 INFO EVNTtoHITS executor returns 65 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.validate 2017-03-13 15:55:51,805 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65) 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.validate 2017-03-13 15:55:51,841 INFO Scanning logfile log.EVNTtoHITS for errors 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.transform.execute 2017-03-13 15:55:52,375 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider" 2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.transform.execute 2017-03-13 15:55:55,732 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider") 2017-03-13 15:59:34 (4580): Guest Log: - Walltime - 2017-03-13 15:59:34 (4580): Guest Log: JobRetrival=0, StageIn=6, Execution=1023, StageOut=0, CleanUp=9 |
Send message Joined: 4 Jul 12 Posts: 2 Credit: 2,954,047 RAC: 1 |
I can not calculate more than 3 Wus at the same time <app_config> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>4.000000</avg_ncpus> <cmdline>--memory_size_mb 4600</cmdline> </app_version> </app_config> max memory usage when active: 117897.91MB max memory usage when idle: 129687.70MB max disk usage: 200.70GB Processor: 36 Intel(R) Xeon Memory: 127.93 GB physical, 127.93 GB virtual Disk: 232.33 GB total, 182.25 GB free Win 10 x64 Pro Boinc 7.6.33 x64 VirtualBox 5.1.14 |
Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,303,095 RAC: 284,891 |
What do you have your max number of jobs set to? If you set to 24 then you are at the max for project. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Hi all, The infrastructure problems were fixed yesterday so most WU should succeed now. Sorry for the inconvenience. |
Send message Joined: 4 Jul 12 Posts: 2 Credit: 2,954,047 RAC: 1 |
What do you have your max number of jobs set to? If you set to 24 then you are at the max for project. No Limit |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
I completed one 1.01 but got a validate error. Tullio |
Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 |
I think there may still be some other issue. All ATLAS WUs still fail on my machine, and the following 2 WUs also failed on another machine: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60483240 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521 Maybe this is the root cause of a problem that lead to the servers being overloaded over the WE. We are the product of random evolution. |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 10,128 |
All ATLAS WUs still fail on my machine, Sorry, but then you have a problem on your machine or your network. Don't know, if you can access this list but I have already crunched a lot of Atlas 1.01 successfull since today morning: https://lhcathome.cern.ch/lhcathome/results.php?userid=555&offset=0&show_names=0&state=4&appid=14 This WU https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521 is now on one of my machines, I have pimped it up to be the next to be crunched, let's see later what has happened. If really all Atlas-WUs on your machine(s) are failing you should urgent take a walk through my old checklist: http://atlasathome.cern.ch/forum_thread.php?id=581&postid=5350#5350 Supporting BOINC, a great concept ! |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,652 RAC: 2,067 |
All ATLAS WUs still fail on my machine, A 3-core ATLAS did well this morning, but thereafter I wanted to run a single core and that one died early. https://lhcathome.cern.ch/lhcathome/result.php?resultid=126128594 - CRITICAL Transform executor raised TransformValidationException: EVNTtoHITS got a SIGKILL signal (exit code 137 |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 10,128 |
@All that have actual problems running Atlas where most or all WUs fail: The BOINC-Client isn't very good running these VM-Sub-Projects from this new consolidated LHC@Home. The Reasons are widely variiing but at the moment we will have to live with it. Shure, some WUs of Atlas fail but this rate should keep less than 10% So, it may help, if you suspend / de-select other VM-Subprojects for a while and test, how this works. Since David opened the gates today morning I haven't had any failed Atlas-Job (but shure some will come). I'm running Atlas only, no CSM, no Theory, no LHCb and even no Sixtrack on my Atlas-Crunchers. Give it a try and run only Atlas-Tasks for a while Supporting BOINC, a great concept ! |
©2024 CERN