Message boards : ATLAS application : New app version 1.01
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29199 - Posted: 12 Mar 2017, 6:36:54 UTC

All mentioned error tasks have the same FATAL condition:

database connection COOLOFL_LAR/OFLP200 cannot be opened - STOP
ID: 29199 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29200 - Posted: 12 Mar 2017, 8:52:50 UTC

database connection COOLOFL_LAR/OFLP200 cannot be opened

Well spotted!
So does it mean that, indeed, some port needs to be allowed through my firewall / router / ISP provider?
ID: 29200 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29201 - Posted: 12 Mar 2017, 9:02:52 UTC - in response to Message 29200.  
Last modified: 12 Mar 2017, 9:03:12 UTC

database connection COOLOFL_LAR/OFLP200 cannot be opened

Well spotted!
So does it mean that, indeed, some port needs to be allowed through my firewall / router / ISP provider?

Yes, you have to do this, see http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use

But this error seems not to be a problem on your side, as I have finished something with 60x 1.01 tasks succesfull, but 6 with failed. In this failed ones I can find this error too so it seems to be a problem of the Outside-Server or some Config-Parameters of these WUs

Not all failed WUs have this error code: database connection COOLOFL_LAR/OFLP200 cannot be opened


Supporting BOINC, a great concept !
ID: 29201 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29206 - Posted: 12 Mar 2017, 11:50:58 UTC

Another workunit reached the state Too many errors (may have bug)

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60039893

The errors of 4 tasks:
- DetectorStore FATAL in sysInitialize(): standard std::exception is caught
- IOVDbSvc FATAL Conditions database connection COOLOFL_LAR/OFLP200 cannot be opened - STOP
- IOVDbSvc FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP
- DetectorStore FATAL in sysInitialize(): standard std::exception is caught
ID: 29206 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 374
Credit: 13,561,875
RAC: 7,621
Message 29216 - Posted: 13 Mar 2017, 8:25:23 UTC - in response to Message 29206.  

Hi,

We are suffering infrastructure problems due to tasks running on the ATLAS grid which are overloading database servers.

This host ccsqfatlasli01.in2p3.fr is one of the servers that WU contact while running to get conditions data (basically data describing the geometry and status of the ATLAS detector), and this service is not working at the moment. I'm checking if we should wait for this service to be recovered or if there is an alternative one we can use.
ID: 29216 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29218 - Posted: 13 Mar 2017, 9:29:54 UTC

New tasks were added to the ATLAS-queue and I got 3 all ending in Validate Error.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566663 - FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP
https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566900 - FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP
https://lhcathome.cern.ch/lhcathome/result.php?resultid=125566914 - FATAL Conditions database connection COOLOFL_CALO/OFLP200 cannot be opened - STOP

I set the client to No New Work
ID: 29218 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29221 - Posted: 13 Mar 2017, 11:51:52 UTC - in response to Message 29218.  

New tasks were added to the ATLAS-queue and I got 3 all ending in Validate Error.

Same here


Supporting BOINC, a great concept !
ID: 29221 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29223 - Posted: 13 Mar 2017, 12:44:21 UTC

The issues at ATLAS@Home and here have apparently the same root cause: server overload. Maybe the team at LHC should ask the users to stop getting new tasks till those servers have recovered.
We are the product of random evolution.
ID: 29223 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29225 - Posted: 13 Mar 2017, 14:08:20 UTC

Something seems to have changed, I have a running Atlas-Task at 36 Minutes


Supporting BOINC, a great concept !
ID: 29225 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 374
Credit: 13,561,875
RAC: 7,621
Message 29226 - Posted: 13 Mar 2017, 14:48:47 UTC - in response to Message 29225.  

It seems like the affected services were brought back online, I see a few successful WU coming in now.
ID: 29226 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1590
Credit: 67,742,986
RAC: 238,415
Message 29227 - Posted: 13 Mar 2017, 15:09:15 UTC

Sorry,

this WU https://lhcathome.cern.ch/lhcathome/result.php?resultid=125577654

finished after 20 Min. duration and 6 min CPU:

- Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_5901_1489415914/PandaJob_3281023545_1489415928/athena_stdout.txt -
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,082 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,086 INFO Now writing wrapper for substep executor EVNTtoHITS
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-03-13 15:40:27,086 INFO Valgrind not engaged
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.preExecute 2017-03-13 15:40:27,086 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.execute 2017-03-13 15:40:27,086 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.execute 2017-03-13 15:55:50,604 INFO EVNTtoHITS executor returns 65
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.validate 2017-03-13 15:55:51,805 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (65) (Error code 65)
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.trfExe.validate 2017-03-13 15:55:51,841 INFO Scanning logfile log.EVNTtoHITS for errors
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.transform.execute 2017-03-13 15:55:52,375 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider"
2017-03-13 15:59:34 (4580): Guest Log: PyJobTransforms.transform.execute 2017-03-13 15:55:55,732 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (65); Logfile error in log.EVNTtoHITS: "AthMpEvtLoopMgr FATAL makePool failed for AthMpEvtLoopMgr.SharedEvtQueueProvider")
2017-03-13 15:59:34 (4580): Guest Log: - Walltime -
2017-03-13 15:59:34 (4580): Guest Log: JobRetrival=0, StageIn=6, Execution=1023, StageOut=0, CleanUp=9
ID: 29227 · Report as offensive     Reply Quote
Egon Olsen

Send message
Joined: 4 Jul 12
Posts: 2
Credit: 2,949,691
RAC: 0
Message 29231 - Posted: 13 Mar 2017, 18:16:48 UTC
Last modified: 13 Mar 2017, 18:20:59 UTC

I can not calculate more than 3 Wus at the same time


<app_config>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>4.000000</avg_ncpus>
<cmdline>--memory_size_mb 4600</cmdline>
</app_version>
</app_config>



max memory usage when active: 117897.91MB
max memory usage when idle: 129687.70MB
max disk usage: 200.70GB






Processor: 36 Intel(R) Xeon
Memory: 127.93 GB physical, 127.93 GB virtual
Disk: 232.33 GB total, 182.25 GB free

Win 10 x64 Pro
Boinc 7.6.33 x64
VirtualBox 5.1.14
ID: 29231 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 728
Credit: 494,724,685
RAC: 308,305
Message 29238 - Posted: 14 Mar 2017, 7:05:36 UTC - in response to Message 29231.  

What do you have your max number of jobs set to? If you set to 24 then you are at the max for project.
ID: 29238 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 374
Credit: 13,561,875
RAC: 7,621
Message 29239 - Posted: 14 Mar 2017, 8:29:21 UTC

Hi all,

The infrastructure problems were fixed yesterday so most WU should succeed now. Sorry for the inconvenience.
ID: 29239 · Report as offensive     Reply Quote
Egon Olsen

Send message
Joined: 4 Jul 12
Posts: 2
Credit: 2,949,691
RAC: 0
Message 29242 - Posted: 14 Mar 2017, 10:47:22 UTC - in response to Message 29238.  

What do you have your max number of jobs set to? If you set to 24 then you are at the max for project.


No Limit
ID: 29242 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 704
Credit: 4,274,121
RAC: 657
Message 29243 - Posted: 14 Mar 2017, 10:55:38 UTC

I completed one 1.01 but got a validate error.
Tullio
ID: 29243 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29247 - Posted: 14 Mar 2017, 13:39:05 UTC

I think there may still be some other issue. All ATLAS WUs still fail on my machine, and the following 2 WUs also failed on another machine:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60483240
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521
Maybe this is the root cause of a problem that lead to the servers being overloaded over the WE.
We are the product of random evolution.
ID: 29247 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29248 - Posted: 14 Mar 2017, 13:58:21 UTC - in response to Message 29247.  

All ATLAS WUs still fail on my machine,

Sorry, but then you have a problem on your machine or your network.

Don't know, if you can access this list but I have already crunched a lot of Atlas 1.01 successfull since today morning: https://lhcathome.cern.ch/lhcathome/results.php?userid=555&offset=0&show_names=0&state=4&appid=14

This WU https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=60460521 is now on one of my machines, I have pimped it up to be the next to be crunched, let's see later what has happened.

If really all Atlas-WUs on your machine(s) are failing you should urgent take a walk through my old checklist: http://atlasathome.cern.ch/forum_thread.php?id=581&postid=5350#5350


Supporting BOINC, a great concept !
ID: 29248 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1124
Credit: 6,910,251
RAC: 1,174
Message 29249 - Posted: 14 Mar 2017, 14:05:05 UTC - in response to Message 29248.  

All ATLAS WUs still fail on my machine,

A 3-core ATLAS did well this morning, but thereafter I wanted to run a single core and that one died early.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=126128594 - CRITICAL Transform executor raised TransformValidationException: EVNTtoHITS got a SIGKILL signal (exit code 137
ID: 29249 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 437
Credit: 117,893,361
RAC: 2,337
Message 29250 - Posted: 14 Mar 2017, 14:13:37 UTC

@All that have actual problems running Atlas where most or all WUs fail:

The BOINC-Client isn't very good running these VM-Sub-Projects from this new consolidated LHC@Home. The Reasons are widely variiing but at the moment we will have to live with it.

Shure, some WUs of Atlas fail but this rate should keep less than 10%

So, it may help, if you suspend / de-select other VM-Subprojects for a while and test, how this works.

Since David opened the gates today morning I haven't had any failed Atlas-Job (but shure some will come).

I'm running Atlas only, no CSM, no Theory, no LHCb and even no Sixtrack on my Atlas-Crunchers. Give it a try and run only Atlas-Tasks for a while


Supporting BOINC, a great concept !
ID: 29250 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : ATLAS application : New app version 1.01


©2022 CERN