Message boards :
Number crunching :
Sixtrack has heartbeat failure -- hangs
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0 |
A number of recent WUs processed to 90%+ then hung with time remaining showing "---". The SIXTRACK executable is showing as still running but not consuming any CPU time. STDERR.TXT in the various slot directories all shows reports that the heartbeat was not detected. If you shut down BOINC and restart it, the process will restart from the last checkpoint then do exactly the same thing all over again. 00:04:04 (11060): No heartbeat from client for 30 sec - exiting 01:11:10 (4228): No heartbeat from client for 30 sec - exiting 01:43:52 (8532): No heartbeat from client for 30 sec - exiting 01:49:31 (8276): No heartbeat from client for 30 sec - exiting This has happened to the following WUs so far: w-b3_6000_job.HLLHC_b3_6000.0732__26__s__62.31_60.32__13_15__5__35.2942_1_sixvf_boinc6769_1 w-b3_16000_job.HLLHC_b3_16000.0732__30__s__62.31_60.32__15_17__5__35.2942_1_sixvf_boinc7819_1 w-b3_10000_job.HLLHC_b3_10000.0732__28__s__62.31_60.32__17_19__5__17.6471_1_sixvf_boinc7367_1 w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__21.1765_1_sixvf_boinc3514_0 w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__19.4118_1_sixvf_boinc3513_0 |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thank you Brian; that is very useful input. We are seeing many many disk limit exceeded so maybe SixTrack is in a loop.......We shall see. |
Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0 |
Heartbeat errors are often a result of excessive HDD i/o. Running one workunit with more than 30 files constantly beeing written seems to be something a system still can survive easily. With four concurrent workunits, each one writing a lot of stuff to HDD - especially if it has to jump between 128 or more open files permanently - the load renders the HDD write cache useless. If BOINC now tries to evaluate the HDD usage (in order to check the limits) or it just wants to copy and rewrite client_state.xml, it might end up at the end of quite a long HDD access queue. While it is stuck in that queue, it does nothing else than wait and it has no chance to refresh the timestamp in shared memory. I have a similar effect when I run too many concurrent Malaria WUs, which write a lot of stuff for checkpointing (it seems they are even using the zlib library functions gzopen/gzwrite). For Malaria it helped to reduce the checkpoint frequency *) - but there is a major disadvantage : If it does not fix the heartbeat bug, your results will be thrown back much further on each restart, so watch your host carefully for a while after you changed the setting. *) e.g. "Tasks checkpoint to disk at most every 600 seconds" |
Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0 |
SIXTRACK is definitely in a loop. These handful of WUs have been restarting themselves at intervals all night. The BOINC error log is littered with reports of the form below. I will have to reset I guess... 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_16000_job.HLLHC_b3_16000.0732__30__s__62.31_60.32__15_17__5__35.2942_1_sixvf_boinc7819_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_14000_job.HLLHC_b3_14000.0732__27__s__62.31_60.32__17_19__5__10.5883_1_sixvf_boinc7103_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_10000_job.HLLHC_b3_10000.0732__28__s__62.31_60.32__17_19__5__17.6471_1_sixvf_boinc7367_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__21.1765_1_sixvf_boinc3514_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__19.4118_1_sixvf_boinc3513_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__14.1177_1_sixvf_boinc3510_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__11_13__5__84.7061_1_sixvf_boinc6955_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__11_13__5__17.6471_1_sixvf_boinc6917_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__17_19__5__3.52942_1_sixvf_boinc7059_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__88.2355_1_sixvf_boinc7057_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__37.0589_1_sixvf_boinc7028_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__17.6471_1_sixvf_boinc7017_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project. |
Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0 |
Tasks checkpoint to disk at most every 600 seconds Did this several years ago across the BOINC form. The disks were threatening to melt :). |
Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0 |
We are seeing many many disk limit exceeded so maybe SixTrack is in Just checked all the slot directories. Total disk usage for any of them is between 335MB and 415MB. These WUs appear to have the new 500MB disk space limit per BOINC_TASK_STATE.XML. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Now we seem to be getting back to "normal" we shall investigate w-b3 tasks locally and find out the problem. |
Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0 |
And thanks for the prompt reaction to push things back towards the "normal" direction. |
Send message Joined: 10 Apr 14 Posts: 5 Credit: 1,106,142 RAC: 0 |
One of my b3's was completed and validated: w-b3_-4000_job.HLLHC_b3_-4000.0732__3__s__62.31_60.32__17_19__5__65.2943_1_sixvf_boinc2087 But yeah, i think it's best that that batch of wu's get aborted. It doesn't make sense to waste CPU time on unit's that will probably fail. |
©2024 CERN