Message boards :
Number crunching :
task 203055614
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Aug 15 Posts: 27 Credit: 10,062,592 RAC: 4,387 |
runtime > 4 d cpu time 10 d -> no credits Can you tellme, what is the problem? |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
It looks like you are running 8 core tasks with less than 16GB ram You got lucky with one task as far as credits but not the other one and since they took that long all your other tasks are Not started by deadline https://lhcathome.cern.ch/lhcathome/result.php?resultid=203352693 https://lhcathome.cern.ch/lhcathome/result.php?resultid=203055614 Go to your account settings and change preferences to Max # CPUs 2 and see if you have better luck.(and maybe theMax # jobs too) You do have 8 of those tasks ready to run so you may have to abort them since they probably would stay at 8-core tasks even after the update and reboot but you can try that first and see. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,466,761 RAC: 123,779 |
Your logs show that you throttled down your CPU. Setting CPU throttle for VM. (60%) LHC WUs normally become more reliable if they run at 100%. You may consider to follow Magic's suggestion and reduce the #cores instead of throttling. In addition less cores per WU would be more efficient by design. |
Send message Joined: 27 Aug 15 Posts: 27 Credit: 10,062,592 RAC: 4,387 |
Thanks for your posts, but they are not helpful for me. 8-core tasks are running several years 24 h/d. Throtteling from 30 ... 90 % depends on roomtemperatur and nonATLAS tasks. I need an explanation, what was going wrong. I do not understand: [url]https://lhcathome.cern.ch/lhcathome/result.php?resultid=203055614 |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
You just had an error And you haven't been running these multi-core Atlas tasks for years since they are new (and I did the Atlas-alpha tests before they came here) These started here March 2017 Since that one task you have had 6 of the usual Valids but if you keep running 8-core Atlas tasks with less than 16GB ream then errors can happen. (I tested MANY 8-6-4-and 2 core versions of these Atlas tasks before they even came here to LHC) but 4 and 2 core worked best. this Valid had several *pause and running* of the VB this Invalid had the errors Guest Log: PyJobTransforms.trfValidation.scanLogFile 2018-07-31 01:23:35,436 WARNING Detected G4 exception report - activating G4 exception grabber 2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.trfValidation.scanLogFile 2018-07-31 01:23:35,437 WARNING Detected G4 exception report - activating G4 exception grabber 2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.trfExe.validate 2018-07-31 01:23:38,904 INFO Executor EVNTtoHITS has validated successfully 2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.transform.execute 2018-07-31 01:23:38,904 CRITICAL Transform executor raised TransformValidationException: Non-zero return code from HITSMergeAthenaMP0 (8); Logfile error in log.HITSMergeAthenaMP0: "IOError: [Errno 5] Input/output error: '/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasOffline/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/jobOptions/SimuJobTransforms/skeleton.HITSMerge.py'" 2018-08-01 19:35:52 (3336): Guest Log: PyJobTransforms.transform.execute 2018-07-31 02:21:10,645 WARNING Transform now exiting early with exit code 65 (Non-zero return code from HITSMergeAthenaMP0 (8); Logfile error in log.HITSMergeAthenaMP0: "IOError: [Errno 5] Input/output error: '/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasOffline/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/jobOptions/SimuJobTransforms/skeleton.HITSMerge.py'") And when you check that same task wingman got it Valid using 4 cores on a 16 core CPU in Run time 8 hours 5 min 46 sec CPU time 13 hours 48 min 58 sec |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I need an explanation, what was going wrong. It might be just a glitch that nobody can explain? So far the 30% throttling is the best (most likely) explanation but maybe it's not the correct explanation? Everybody gets an unexplainable error once in a while. If throttling and 8-core tasks yields +90% success for you then that's very good considering some volunteers get 99% failures :) Advice to run 2 X 2-core tasks is good advice because 2-core tasks are more efficient. Maybe continue for a while with 8-core and if it doesn't work out well then consider 2-core. Also, even if an ATLAS result is marked "valid" it is possible that it did not return a HITS file. So check stderr output and use your webbrowser's search function to search for "hits". If you see "error 65" then no HITS file was returned. You still get credits but it was wasted effort and the task will be resent to another host. |
©2024 CERN