Sixtrack has heartbeat failure -- hangs

Author	Message
Brian Priebe Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0	Message 26827 - Posted: 4 Oct 2014, 6:33:17 UTC A number of recent WUs processed to 90%+ then hung with time remaining showing "---". The SIXTRACK executable is showing as still running but not consuming any CPU time. STDERR.TXT in the various slot directories all shows reports that the heartbeat was not detected. If you shut down BOINC and restart it, the process will restart from the last checkpoint then do exactly the same thing all over again. 00:04:04 (11060): No heartbeat from client for 30 sec - exiting 01:11:10 (4228): No heartbeat from client for 30 sec - exiting 01:43:52 (8532): No heartbeat from client for 30 sec - exiting 01:49:31 (8276): No heartbeat from client for 30 sec - exiting This has happened to the following WUs so far: w-b3_6000_job.HLLHC_b3_6000.0732__26__s__62.31_60.32__13_15__5__35.2942_1_sixvf_boinc6769_1 w-b3_16000_job.HLLHC_b3_16000.0732__30__s__62.31_60.32__15_17__5__35.2942_1_sixvf_boinc7819_1 w-b3_10000_job.HLLHC_b3_10000.0732__28__s__62.31_60.32__17_19__5__17.6471_1_sixvf_boinc7367_1 w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__21.1765_1_sixvf_boinc3514_0 w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__19.4118_1_sixvf_boinc3513_0 ID: 26827 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26835 - Posted: 4 Oct 2014, 9:45:37 UTC - in response to Message 26827. Thank you Brian; that is very useful input. We are seeing many many disk limit exceeded so maybe SixTrack is in a loop.......We shall see. ID: 26835 · Reply Quote

Ananas Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0	Message 26841 - Posted: 4 Oct 2014, 11:56:11 UTC Last modified: 4 Oct 2014, 12:01:31 UTC Heartbeat errors are often a result of excessive HDD i/o. Running one workunit with more than 30 files constantly beeing written seems to be something a system still can survive easily. With four concurrent workunits, each one writing a lot of stuff to HDD - especially if it has to jump between 128 or more open files permanently - the load renders the HDD write cache useless. If BOINC now tries to evaluate the HDD usage (in order to check the limits) or it just wants to copy and rewrite client_state.xml, it might end up at the end of quite a long HDD access queue. While it is stuck in that queue, it does nothing else than wait and it has no chance to refresh the timestamp in shared memory. I have a similar effect when I run too many concurrent Malaria WUs, which write a lot of stuff for checkpointing (it seems they are even using the zlib library functions gzopen/gzwrite). For Malaria it helped to reduce the checkpoint frequency ) - but there is a major disadvantage : If it does not fix the heartbeat bug, your results will be thrown back much further on each restart, so watch your host carefully for a while after you changed the setting. ) e.g. "Tasks checkpoint to disk at most every 600 seconds" ID: 26841 · Reply Quote

Brian Priebe Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0	Message 26842 - Posted: 4 Oct 2014, 13:13:48 UTC - in response to Message 26835. SIXTRACK is definitely in a loop. These handful of WUs have been restarting themselves at intervals all night. The BOINC error log is littered with reports of the form below. I will have to reset I guess... 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_16000_job.HLLHC_b3_16000.0732__30__s__62.31_60.32__15_17__5__35.2942_1_sixvf_boinc7819_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_14000_job.HLLHC_b3_14000.0732__27__s__62.31_60.32__17_19__5__10.5883_1_sixvf_boinc7103_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_10000_job.HLLHC_b3_10000.0732__28__s__62.31_60.32__17_19__5__17.6471_1_sixvf_boinc7367_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__21.1765_1_sixvf_boinc3514_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__19.4118_1_sixvf_boinc3513_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__14.1177_1_sixvf_boinc3510_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__11_13__5__84.7061_1_sixvf_boinc6955_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__11_13__5__17.6471_1_sixvf_boinc6917_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__17_19__5__3.52942_1_sixvf_boinc7059_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__88.2355_1_sixvf_boinc7057_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__37.0589_1_sixvf_boinc7028_0 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__17.6471_1_sixvf_boinc7017_1 exited with zero status but no 'finished' file 04-Oct-2014 01:43:22 \| LHC@home 1.0 \| If this happens repeatedly you may need to reset the project. ID: 26842 · Reply Quote

Brian Priebe Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0	Message 26843 - Posted: 4 Oct 2014, 13:15:24 UTC - in response to Message 26841. Tasks checkpoint to disk at most every 600 seconds Did this several years ago across the BOINC form. The disks were threatening to melt :). ID: 26843 · Reply Quote

Brian Priebe Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0	Message 26844 - Posted: 4 Oct 2014, 13:39:31 UTC - in response to Message 26835. We are seeing many many disk limit exceeded so maybe SixTrack is in a loop Just checked all the slot directories. Total disk usage for any of them is between 335MB and 415MB. These WUs appear to have the new 500MB disk space limit per BOINC_TASK_STATE.XML. ID: 26844 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26848 - Posted: 4 Oct 2014, 19:01:35 UTC Now we seem to be getting back to "normal" we shall investigate w-b3 tasks locally and find out the problem. ID: 26848 · Reply Quote

Brian Priebe Send message Joined: 28 Nov 09 Posts: 17 Credit: 3,974,186 RAC: 0	Message 26849 - Posted: 4 Oct 2014, 20:09:48 UTC - in response to Message 26848. And thanks for the prompt reaction to push things back towards the "normal" direction. ID: 26849 · Reply Quote

Andrew Sanchez Send message Joined: 10 Apr 14 Posts: 5 Credit: 1,106,142 RAC: 0	Message 26850 - Posted: 4 Oct 2014, 20:21:41 UTC One of my b3's was completed and validated: w-b3_-4000_job.HLLHC_b3_-4000.0732__3__s__62.31_60.32__17_19__5__65.2943_1_sixvf_boinc2087 But yeah, i think it's best that that batch of wu's get aborted. It doesn't make sense to waste CPU time on unit's that will probably fail. ID: 26850 · Reply Quote

LHC@home