Message boards : Number crunching : Sixtrack has heartbeat failure -- hangs
Message board moderation

To post messages, you must log in.

AuthorMessage
Brian Priebe

Send message
Joined: 28 Nov 09
Posts: 17
Credit: 3,974,186
RAC: 0
Message 26827 - Posted: 4 Oct 2014, 6:33:17 UTC

A number of recent WUs processed to 90%+ then hung with time remaining showing "---". The SIXTRACK executable is showing as still running but not consuming any CPU time. STDERR.TXT in the various slot directories all shows reports that the heartbeat was not detected. If you shut down BOINC and restart it, the process will restart from the last checkpoint then do exactly the same thing all over again.

00:04:04 (11060): No heartbeat from client for 30 sec - exiting
01:11:10 (4228): No heartbeat from client for 30 sec - exiting
01:43:52 (8532): No heartbeat from client for 30 sec - exiting
01:49:31 (8276): No heartbeat from client for 30 sec - exiting

This has happened to the following WUs so far:

w-b3_6000_job.HLLHC_b3_6000.0732__26__s__62.31_60.32__13_15__5__35.2942_1_sixvf_boinc6769_1

w-b3_16000_job.HLLHC_b3_16000.0732__30__s__62.31_60.32__15_17__5__35.2942_1_sixvf_boinc7819_1

w-b3_10000_job.HLLHC_b3_10000.0732__28__s__62.31_60.32__17_19__5__17.6471_1_sixvf_boinc7367_1

w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__21.1765_1_sixvf_boinc3514_0

w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__19.4118_1_sixvf_boinc3513_0

ID: 26827 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26835 - Posted: 4 Oct 2014, 9:45:37 UTC - in response to Message 26827.  

Thank you Brian; that is very useful input. We are seeing
many many disk limit exceeded so maybe SixTrack is in
a loop.......We shall see.
ID: 26835 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26841 - Posted: 4 Oct 2014, 11:56:11 UTC
Last modified: 4 Oct 2014, 12:01:31 UTC

Heartbeat errors are often a result of excessive HDD i/o. Running one workunit with more than 30 files constantly beeing written seems to be something a system still can survive easily. With four concurrent workunits, each one writing a lot of stuff to HDD - especially if it has to jump between 128 or more open files permanently - the load renders the HDD write cache useless.

If BOINC now tries to evaluate the HDD usage (in order to check the limits) or it just wants to copy and rewrite client_state.xml, it might end up at the end of quite a long HDD access queue. While it is stuck in that queue, it does nothing else than wait and it has no chance to refresh the timestamp in shared memory.

I have a similar effect when I run too many concurrent Malaria WUs, which write a lot of stuff for checkpointing (it seems they are even using the zlib library functions gzopen/gzwrite). For Malaria it helped to reduce the checkpoint frequency *) - but there is a major disadvantage : If it does not fix the heartbeat bug, your results will be thrown back much further on each restart, so watch your host carefully for a while after you changed the setting.

*) e.g. "Tasks checkpoint to disk at most every 600 seconds"
ID: 26841 · Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 28 Nov 09
Posts: 17
Credit: 3,974,186
RAC: 0
Message 26842 - Posted: 4 Oct 2014, 13:13:48 UTC - in response to Message 26835.  

SIXTRACK is definitely in a loop. These handful of WUs have been restarting themselves at intervals all night. The BOINC error log is littered with reports of the form below. I will have to reset I guess...

04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_16000_job.HLLHC_b3_16000.0732__30__s__62.31_60.32__15_17__5__35.2942_1_sixvf_boinc7819_1 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_14000_job.HLLHC_b3_14000.0732__27__s__62.31_60.32__17_19__5__10.5883_1_sixvf_boinc7103_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_10000_job.HLLHC_b3_10000.0732__28__s__62.31_60.32__17_19__5__17.6471_1_sixvf_boinc7367_1 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__21.1765_1_sixvf_boinc3514_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__19.4118_1_sixvf_boinc3513_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_18000_job.HLLHC_b3_18000.0732__13__s__62.31_60.32__13_15__5__14.1177_1_sixvf_boinc3510_1 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__11_13__5__84.7061_1_sixvf_boinc6955_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__11_13__5__17.6471_1_sixvf_boinc6917_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__17_19__5__3.52942_1_sixvf_boinc7059_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__88.2355_1_sixvf_boinc7057_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__37.0589_1_sixvf_boinc7028_0 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
04-Oct-2014 01:43:22 | LHC@home 1.0 | Task w-b3_8000_job.HLLHC_b3_8000.0732__27__s__62.31_60.32__15_17__5__17.6471_1_sixvf_boinc7017_1 exited with zero status but no 'finished' file
04-Oct-2014 01:43:22 | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.
ID: 26842 · Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 28 Nov 09
Posts: 17
Credit: 3,974,186
RAC: 0
Message 26843 - Posted: 4 Oct 2014, 13:15:24 UTC - in response to Message 26841.  

Tasks checkpoint to disk at most every 600 seconds


Did this several years ago across the BOINC form. The disks were threatening to melt :).
ID: 26843 · Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 28 Nov 09
Posts: 17
Credit: 3,974,186
RAC: 0
Message 26844 - Posted: 4 Oct 2014, 13:39:31 UTC - in response to Message 26835.  

We are seeing many many disk limit exceeded so maybe SixTrack is in
a loop

Just checked all the slot directories. Total disk usage for any of them is between 335MB and 415MB. These WUs appear to have the new 500MB disk space limit per BOINC_TASK_STATE.XML.
ID: 26844 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26848 - Posted: 4 Oct 2014, 19:01:35 UTC

Now we seem to be getting back to "normal" we shall investigate
w-b3 tasks locally and find out the problem.
ID: 26848 · Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 28 Nov 09
Posts: 17
Credit: 3,974,186
RAC: 0
Message 26849 - Posted: 4 Oct 2014, 20:09:48 UTC - in response to Message 26848.  

And thanks for the prompt reaction to push things back towards the "normal" direction.
ID: 26849 · Report as offensive     Reply Quote
Profile Andrew Sanchez

Send message
Joined: 10 Apr 14
Posts: 5
Credit: 1,106,142
RAC: 0
Message 26850 - Posted: 4 Oct 2014, 20:21:41 UTC

One of my b3's was completed and validated:
w-b3_-4000_job.HLLHC_b3_-4000.0732__3__s__62.31_60.32__17_19__5__65.2943_1_sixvf_boinc2087

But yeah, i think it's best that that batch of wu's get aborted. It doesn't make sense to waste CPU time on unit's that will probably fail.
ID: 26850 · Report as offensive     Reply Quote

Message boards : Number crunching : Sixtrack has heartbeat failure -- hangs


©2024 CERN