Message boards : Number crunching : task 203055614
Message board moderation

To post messages, you must log in.

AuthorMessage
Klaus

Send message
Joined: 27 Aug 15
Posts: 27
Credit: 10,062,592
RAC: 4,387
Message 36274 - Posted: 6 Aug 2018, 12:52:06 UTC

runtime > 4 d
cpu time 10 d -> no credits
Can you tellme, what is the problem?
ID: 36274 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 36278 - Posted: 6 Aug 2018, 15:29:54 UTC - in response to Message 36274.  

It looks like you are running 8 core tasks with less than 16GB ram

You got lucky with one task as far as credits but not the other one and since they took that long all your other tasks are Not started by deadline

https://lhcathome.cern.ch/lhcathome/result.php?resultid=203352693

https://lhcathome.cern.ch/lhcathome/result.php?resultid=203055614

Go to your account settings and change preferences to Max # CPUs 2 and see if you have better luck.(and maybe theMax # jobs too)

You do have 8 of those tasks ready to run so you may have to abort them since they probably would stay at 8-core tasks even after the update and reboot but you can try that first and see.
ID: 36278 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,466,761
RAC: 123,779
Message 36289 - Posted: 7 Aug 2018, 6:10:19 UTC

Your logs show that you throttled down your CPU.
Setting CPU throttle for VM. (60%)

LHC WUs normally become more reliable if they run at 100%.
You may consider to follow Magic's suggestion and reduce the #cores instead of throttling.
In addition less cores per WU would be more efficient by design.
ID: 36289 · Report as offensive     Reply Quote
Klaus

Send message
Joined: 27 Aug 15
Posts: 27
Credit: 10,062,592
RAC: 4,387
Message 36295 - Posted: 7 Aug 2018, 9:01:59 UTC

Thanks for your posts, but they are not helpful for me.
8-core tasks are running several years 24 h/d. Throtteling from 30 ... 90 % depends on roomtemperatur and nonATLAS tasks.
I need an explanation, what was going wrong.
I do not understand: [url]https://lhcathome.cern.ch/lhcathome/result.php?resultid=203055614
ID: 36295 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 36297 - Posted: 7 Aug 2018, 9:23:12 UTC - in response to Message 36295.  
Last modified: 7 Aug 2018, 9:32:50 UTC

You just had an error

And you haven't been running these multi-core Atlas tasks for years since they are new (and I did the Atlas-alpha tests before they came here)

These started here March 2017

Since that one task you have had 6 of the usual Valids but if you keep running 8-core Atlas tasks with less than 16GB ream then errors can happen.

(I tested MANY 8-6-4-and 2 core versions of these Atlas tasks before they even came here to LHC) but 4 and 2 core worked best.

this Valid had several *pause and running* of the VB

this Invalid had the errors

Guest Log: PyJobTransforms.trfValidation.scanLogFile 2018-07-31 01:23:35,436 WARNING Detected G4 exception report - activating G4 exception grabber
2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.trfValidation.scanLogFile 2018-07-31 01:23:35,437 WARNING Detected G4 exception report - activating G4 exception grabber
2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.trfExe.validate 2018-07-31 01:23:38,904 INFO Executor EVNTtoHITS has validated successfully
2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.transform.execute 2018-07-31 01:23:38,904 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from HITSMergeAthenaMP0 (8); Logfile error in log.HITSMergeAthenaMP0: "IOError: [Errno 5] Input/output error: '/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasOffline/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/jobOptions/SimuJobTransforms/skeleton.HITSMerge.py'"
2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.transform.execute 2018-07-31 02:21:10,645 WARNING Transform now exiting early with exit code 65 (Non-zero return code from HITSMergeAthenaMP0 (8); Logfile error in log.HITSMergeAthenaMP0: "IOError: [Errno 5] Input/output error: '/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasOffline/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/jobOptions/SimuJobTransforms/skeleton.HITSMerge.py'")

And when you check that same task wingman got it Valid using 4 cores on a 16 core CPU in Run time 8 hours 5 min 46 sec
CPU time 13 hours 48 min 58 sec
ID: 36297 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36313 - Posted: 7 Aug 2018, 18:10:36 UTC - in response to Message 36295.  

I need an explanation, what was going wrong.
I do not understand: https://lhcathome.cern.ch/lhcathome/result.php?resultid=203055614

It might be just a glitch that nobody can explain? So far the 30% throttling is the best (most likely) explanation but maybe it's not the correct explanation?
Everybody gets an unexplainable error once in a while. If throttling and 8-core tasks yields +90% success for you then that's very good considering some volunteers get 99% failures :)
Advice to run 2 X 2-core tasks is good advice because 2-core tasks are more efficient. Maybe continue for a while with 8-core and if it doesn't work out well then consider 2-core.

Also, even if an ATLAS result is marked "valid" it is possible that it did not return a HITS file. So check stderr output and use your webbrowser's search function to search for "hits". If you see "error 65" then no HITS file was returned. You still get credits but it was wasted effort and the task will be resent to another host.
ID: 36313 · Report as offensive     Reply Quote

Message boards : Number crunching : task 203055614


©2024 CERN