Message boards : Number crunching : WU % progress stuck
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 233
Credit: 10,718,799
RAC: 3,994
Message 26572 - Posted: 27 May 2014, 16:52:08 UTC
Last modified: 27 May 2014, 16:55:17 UTC

I've not been paying attention over the weekend but I find today that I have a WU, on my Virtual Ubuntu, listed as Running in Boinc, whose % progress has stopped at 35.701% although elapsed time continues past 30 Hours (sixtrack not listed in "top"). Obviously this is not right, but before I Abort it, is there anything that might be useful for me to save to help identify what caused the blockage? Wingman, on an older Linux returned the WU ok.
Other Virtual box is fine and that one usually is too.
ID: 26572 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 107
Credit: 511,942
RAC: 0
Message 26574 - Posted: 27 May 2014, 22:12:50 UTC

You could try and suspend then resume the task and see if it then continues onwards.
I have a few on another project do just what you are saying with the suspend and resume allowing them to continue without aborting them.

Perhaps before you do this check the contents of the Slot file and see what is in there and maybe post the contents, the projec6t maybe able to find out what caused the problem?

There is a good chance that the CPU usage is at Zero also.

Conan
ID: 26574 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 561
Credit: 348,915,362
RAC: 597,504
Message 26575 - Posted: 28 May 2014, 3:00:16 UTC

you can check the stderr.txt file in the slot directory and see if there is anything in there?

ID: 26575 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 233
Credit: 10,718,799
RAC: 3,994
Message 26580 - Posted: 28 May 2014, 9:47:39 UTC

Thanks, guys,
Only a single line in stderr:
17:37:52 (24103): No heartbeat from core client for 30 sec - exiting

I saved the whole slot before binning the WU just in case there might be anything of note tucked away elsewhere. Suspend/Resume didn't do anything and I should have thought about Exiting and restarting Boinc.

It's gone, now, and other WUs since have run fine so who knows.

If it is felt this thread is, therefore, no longer of interest, feel free to remove it.
ID: 26580 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 851
Credit: 1,616,232
RAC: 132
Message 26581 - Posted: 28 May 2014, 22:34:02 UTC - in response to Message 26572.  

Hi Ray; don't know what to say really....
SixTrack "never" loops like you say!
If you kill it I don't know if you can restart it but it
should restart from the last checkpoint in the same directory.
Eric.

ID: 26581 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 233
Credit: 10,718,799
RAC: 3,994
Message 26582 - Posted: 29 May 2014, 9:18:54 UTC

Thanks, Eric,
Another one got stuck overnight at 31..% with 14hrs ticking up on elapsed time. An exit and restart of Boinc dropped it back by 0.1% and time back to 6hrs but it has now passed the previous stop point and sixtrack shows 90% CPU usage in "top" whereas it wasn't even listed while it was sulking. That was the bit that might be noteworthy; Noted in Boinc as running but no associated process running.
If no one else is experiencing similar, then it might just be some quirk of that system running inside a VM.

ID: 26582 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 182
Credit: 3,295,818
RAC: 0
Message 26583 - Posted: 29 May 2014, 9:57:44 UTC - in response to Message 26582.  

Cases where the project science application fails to make any progress, but appears to be active enough for BOINC to think it's still running, have been reported from pretty much every BOINC project for as long as I've been running it. I only notice the Windows reports, and I have a subjective perception that it happens more often on machines with AMD processors, but apart from that nobody seems to have any clue why it happens, or to have been bothered to look. It's relatively rare.

My personal suspicion is that any cause, or cure, would be found in the BOINC API (Application Programming Interface) - a library supplied by the BOINC developers to projects, to support common standards for information and control messages between the BOINC client and the science apps. Unfortunately, this is cinderella programming, which it's hard to find anybody interested in - it's neither "sexy science" nor "groovy graphics". If any professional-grade Windows programmer, with modern systems-engineering knowledge of the multi-threaded Windows kernel, were to grab the API by the scruff of its neck and shake it into shape, we'd all be grateful.
ID: 26583 · Report as offensive     Reply Quote

Message boards : Number crunching : WU % progress stuck


©2019 CERN