Message boards : ATLAS application : All tasks in error
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile PhilTheNet
Avatar

Send message
Joined: 21 Sep 14
Posts: 25
Credit: 723,818
RAC: 0
Message 29779 - Posted: 3 Apr 2017, 6:40:50 UTC

While everything worked fine since 2 days all tasks are in error fairly quickly unless there was a change on the machine (mac)

2017-04-03 08:28:06 (1881): Guest Log: log_extracts:
2017-04-03 08:28:06 (1881): Guest Log: - Last 10 lines from /home/atlas01/RunAtlas/Panda_Pilot_5966_1491200476/PandaJob_3312154893_1491200483/athena_stdout.txt -
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe.preExecute 2017-04-03 08:23:16,519 INFO Batch/grid running - command outputs will not be echoed. Logs for EVNTtoHITS are in log.EVNTtoHITS
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe.preExecute 2017-04-03 08:23:16,521 INFO Now writing wrapper for substep executor EVNTtoHITS
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe._writeAthenaWrapper 2017-04-03 08:23:16,521 INFO Valgrind not engaged
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe.preExecute 2017-04-03 08:23:16,521 INFO Athena will be executed in a subshell via ['./runwrapper.EVNTtoHITS.sh']
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe.execute 2017-04-03 08:23:16,521 INFO Starting execution of EVNTtoHITS (['./runwrapper.EVNTtoHITS.sh'])
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe.execute 2017-04-03 08:25:45,366 INFO EVNTtoHITS executor returns 33
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe.validate 2017-04-03 08:25:46,379 ERROR Validation of return code failed: Non-zero return code from EVNTtoHITS (33) (Error code 65)
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.trfExe.validate 2017-04-03 08:25:46,423 INFO Scanning logfile log.EVNTtoHITS for errors
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.transform.execute 2017-04-03 08:25:46,450 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from EVNTtoHITS (33); Logfile error in log.EVNTtoHITS: "IOVDbSvc FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP"
2017-04-03 08:28:06 (1881): Guest Log: PyJobTransforms.transform.execute 2017-04-03 08:25:49,655 WARNING Transform now exiting early with exit code 65 (Non-zero return code from EVNTtoHITS (33); Logfile error in log.EVNTtoHITS: "IOVDbSvc FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP")
2017-04-03 08:28:06 (1881): Guest Log: - Walltime -

an idea ???
ID: 29779 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 29780 - Posted: 3 Apr 2017, 8:01:48 UTC - in response to Message 29779.  

The problem is due to database servers used by all ATLAS tasks (on the whole ATLAS grid, not just ATLAS@Home) are overloaded and tasks are failing to connect to them to download the necessary information. This happened from time to time on the old project and this is the first time it happened since the consolidation to LHC. The experts are working on fixing it, more news soon.
ID: 29780 · Report as offensive     Reply Quote
Profile PhilTheNet
Avatar

Send message
Joined: 21 Sep 14
Posts: 25
Credit: 723,818
RAC: 0
Message 29782 - Posted: 3 Apr 2017, 8:14:19 UTC - in response to Message 29780.  

Thanks for your quick reply :)
ID: 29782 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,445,220
RAC: 103,133
Message 29794 - Posted: 3 Apr 2017, 17:25:05 UTC

after several WUs had functioned well this afternoon, there was one which again failed after 10 minutes:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=131789181

The next one then was okay.
ID: 29794 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29796 - Posted: 3 Apr 2017, 17:59:54 UTC

there was one which again failed after 10 minutes

It looks like a different issue though. Those that failed starting yesterday had the error:
"IOVDbSvc            FATAL Conditions database connection COOLOFL_TRT/OFLP200 cannot be opened - STOP"

The task you mentioned failed with this error:
"DetectorStore       FATAL in sysInitialize(): standard std::exception is caught"

We are the product of random evolution.
ID: 29796 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,445,220
RAC: 103,133
Message 29799 - Posted: 3 Apr 2017, 19:29:47 UTC

Ah, thanks for the hint. I didn't catch that (having been in a hurry, I did not look up the log carefully enough)

From my experience with ATLAS so far, there seem to be 100 or 1000 different reasons which can cause a task to fail :-(
ID: 29799 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29801 - Posted: 3 Apr 2017, 19:42:59 UTC

From my experience with ATLAS so far, there seem to be 100 or 1000 different reasons which can cause a task to fail :-(

Yeap, when I joined in December last year I did not expect that volunteering for ATLAS and LHC would be so time consuming... But I won't give up for a penny.
Instead I am planning on buying a 6950X soon ;). Well, I must admit that I'll be using those machines for my own calculations as well.
We are the product of random evolution.
ID: 29801 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,826,878
RAC: 228,657
Message 29803 - Posted: 3 Apr 2017, 21:13:23 UTC - in response to Message 29801.  
Last modified: 3 Apr 2017, 21:15:46 UTC

Herve, the Xeon's get better RAC for lower price if the low number core performance outside BOINC is not a consideration.

https://lhcathome.cern.ch/lhcathome/hosts_user.php?userid=129087

For the same price E5-2680v4 has 14 Cores with clocks of 2.4-2.9-3.3Ghz, you can max this project with 24 tasks at once. I would expect around 50% higher RAC based on my computers. My old E5-2683 @ 2-2.5-3Ghz scores better.

I would recommend 64GB, I don't see problems with 24 concurrent WU's on my computers running only single core tasks.
ID: 29803 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 29833 - Posted: 5 Apr 2017, 18:37:11 UTC

For the same price E5-2680v4 has 14 Cores with clocks of 2.4-2.9-3.3Ghz, you can max this project with 24 tasks at once.

Thanks Toby for the suggestion. I remember reading a thread on the ATLAS@Home where the performance of various CPUs were compared and thought of going for Xeon as well.
Then I checked the specialised shops in Dubai and nobody sells Xeon processors around here, not to mention motherboards. So I made up my mind for the Core i7-6950 which can easily be found in Dubai.
We are the product of random evolution.
ID: 29833 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,826,878
RAC: 228,657
Message 29837 - Posted: 5 Apr 2017, 20:53:03 UTC - in response to Message 29833.  

The boards are easy all(?) X99 boards support them. Finding them is more tricky I got some of mine from general computer stores online.

I'm sure you'll like, I would say it the most Xeon like of all consumer processors, albeit for a crazy price.
ID: 29837 · Report as offensive     Reply Quote

Message boards : ATLAS application : All tasks in error


©2024 CERN