Message boards : Number crunching : Computation error on network loss
Message board moderation

To post messages, you must log in.

AuthorMessage
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 0
Message 25696 - Posted: 24 Aug 2013, 1:10:07 UTC

Any suggestions on how to avoid computation errors as a result of network hiccups?

I just lost 28 hours of crunching on a valiantly working Intel Atom-powered netbook because our Telecom provider had some hiccups. I realize this is a known problem with LHC, but my Einstein tasks (which also run for a long time) are not bothered by a temporary loss of connection. Is there any way to solve this for LHC as well?
ID: 25696 · Report as offensive     Reply Quote
captainjack

Send message
Joined: 21 Jun 10
Posts: 42
Credit: 13,648,408
RAC: 35,713
Message 25698 - Posted: 24 Aug 2013, 15:07:19 UTC

Looks like you are running Linux and got a signal 11 error. The same thing happens over at WCG. Some of their projects will abort with a signal 11 and some keep running. Some people believe it is as much of a Linux problem as it is a BOINC/research problem. It never happens on my Windows 7 boxes.

Many people are reluctant to switch to Windows for a variety of reasons, one of which is that Linux is reputed to be 10-15% faster than Windows. For some of the WCG projects, Linux is about twice as fast as Windows.

So we keep running Linux knowing that it will have an occasional hiccup but there will be an overall speed gain.

Hope that helps.
ID: 25698 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 0
Message 25699 - Posted: 24 Aug 2013, 20:26:00 UTC - in response to Message 25698.  

Interesting. I didn't know Linux gave a speed gain. That is some consolation for the occassional loss. However, it may still be good to figure out how Einstein manages to avoid the same problems.
ID: 25699 · Report as offensive     Reply Quote
Phil
Avatar

Send message
Joined: 26 Jul 05
Posts: 63
Credit: 4,083,755
RAC: 0
Message 25700 - Posted: 25 Aug 2013, 16:36:36 UTC

What I do know is, LHC requires your client connect to the server at regular occasions.
Perhaps if you select "network activity suspended" and only connect it when you're online and using your machine, it might stop this error?
ID: 25700 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,807,848
RAC: 0
Message 25701 - Posted: 27 Aug 2013, 3:07:22 UTC - in response to Message 25700.  

Just happened again. Got a computation error after 41 hours of crunching. Really bummed out by that.

Why does LHC need to maintain a connection while it is working on a job? If it does make connections, can it at least set a check point prior to doing so, and revert to that checkpoint if a connection fails?

I considered the suggestion to use "network activity suspended" but I don't think that would help. I also crunch Test-4-Theory and that requires a continuous connection; for reasons that I do understand. Moreover, I could still get a computation error if my ISP or my wireless gets the wobbles while I am actually online and using the machine.

I don't understand why network loss should lead to a computation error?
ID: 25701 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,888,115
RAC: 831
Message 25702 - Posted: 27 Aug 2013, 19:58:15 UTC

Hi Jelle,
There is no need to be connected while a WU is in progress. You only need a connection here when work is being downloaded or returned. Over at T4T the network is needed for contact between the VM and the wrapper and Boinc but the physical external network is again only needed during uploads and downloads at the end of each "job" and Boinc WU. You can unplug from the external network without any problem.
Your failed WUs had a Signal 11 error as mentioned below but I'm sorry, I don't what causes that. It seems to only be your Atom that's having the problem. Is there something different with its setup, compared to your other machines?
ID: 25702 · Report as offensive     Reply Quote

Message boards : Number crunching : Computation error on network loss


©2025 CERN