Message boards : Number crunching : Suggestions to possibly lower error rates a bit
Message board moderation

To post messages, you must log in.

AuthorMessage
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 33061 - Posted: 14 Nov 2017, 4:26:42 UTC

*When any LHC@Home task is suspended on the BOINC client, the server should refuse to send any new tasks.
It's possible since other BOINC projects I've run have refused new work when a local task is suspended.

End user might have issues with particular tasks, a computer problem or high priority work that precludes BOINC tasks and manually suspended tasks to address the issues. LHC@Home is particular to this since trying to suspend 32 tasks at once always ends with aborted VM's instead of saved states and tasks have to be suspended 4 to 8 at a time.


*When starting up 16 to 40 LHC@Home tasks from a cold BOINC start, stagger the startups (not sure this is possible within current BOINC client software limits).

*If staggering isn't possible then increase the number of retries to get a stable connection to Condor or increase the amount of time the tasks waits before timing out the connection. Make the tasks more stubborn (increase fault tolerance) before giving up the ghost (computation error).


Over 70% of my errors occur within the first 20 minutes from the task startup and are mostly communication issues from a data spike on cold BOINC startup or when the ISP resets my modem in the early morning local hours and another spike comes from all the tasks trying to reestablish with Condor.


The second greatest error rates are from shutting down without doing any work. Is that because Condor had no work to send down to the VM in that time period?
ID: 33061 · Report as offensive     Reply Quote

Message boards : Number crunching : Suggestions to possibly lower error rates a bit


©2024 CERN