Message boards :
Number crunching :
WU % progress stuck
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
I've not been paying attention over the weekend but I find today that I have a WU, on my Virtual Ubuntu, listed as Running in Boinc, whose % progress has stopped at 35.701% although elapsed time continues past 30 Hours (sixtrack not listed in "top"). Obviously this is not right, but before I Abort it, is there anything that might be useful for me to save to help identify what caused the blockage? Wingman, on an older Linux returned the WU ok. Other Virtual box is fine and that one usually is too. |
Send message Joined: 6 Jul 06 Posts: 108 Credit: 663,175 RAC: 68 |
You could try and suspend then resume the task and see if it then continues onwards. I have a few on another project do just what you are saying with the suspend and resume allowing them to continue without aborting them. Perhaps before you do this check the contents of the Slot file and see what is in there and maybe post the contents, the projec6t maybe able to find out what caused the problem? There is a good chance that the CPU usage is at Zero also. Conan |
Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,473,500 RAC: 278,149 |
you can check the stderr.txt file in the slot directory and see if there is anything in there? |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
Thanks, guys, Only a single line in stderr: 17:37:52 (24103): No heartbeat from core client for 30 sec - exiting I saved the whole slot before binning the WU just in case there might be anything of note tucked away elsewhere. Suspend/Resume didn't do anything and I should have thought about Exiting and restarting Boinc. It's gone, now, and other WUs since have run fine so who knows. If it is felt this thread is, therefore, no longer of interest, feel free to remove it. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Hi Ray; don't know what to say really.... SixTrack "never" loops like you say! If you kill it I don't know if you can restart it but it should restart from the last checkpoint in the same directory. Eric. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
Thanks, Eric, Another one got stuck overnight at 31..% with 14hrs ticking up on elapsed time. An exit and restart of Boinc dropped it back by 0.1% and time back to 6hrs but it has now passed the previous stop point and sixtrack shows 90% CPU usage in "top" whereas it wasn't even listed while it was sulking. That was the bit that might be noteworthy; Noted in Boinc as running but no associated process running. If no one else is experiencing similar, then it might just be some quirk of that system running inside a VM. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 |
Cases where the project science application fails to make any progress, but appears to be active enough for BOINC to think it's still running, have been reported from pretty much every BOINC project for as long as I've been running it. I only notice the Windows reports, and I have a subjective perception that it happens more often on machines with AMD processors, but apart from that nobody seems to have any clue why it happens, or to have been bothered to look. It's relatively rare. My personal suspicion is that any cause, or cure, would be found in the BOINC API (Application Programming Interface) - a library supplied by the BOINC developers to projects, to support common standards for information and control messages between the BOINC client and the science apps. Unfortunately, this is cinderella programming, which it's hard to find anybody interested in - it's neither "sexy science" nor "groovy graphics". If any professional-grade Windows programmer, with modern systems-engineering knowledge of the multi-threaded Windows kernel, were to grab the API by the scruff of its neck and shake it into shape, we'd all be grateful. |
©2024 CERN