21) Message boards : Number crunching : Big problem: work units running with negative time. (Message 25805)
Posted 8 Sep 2013 by jelle
Post:
As I'm typing this I wonder if it has anything to do with DNS assumptions. I remember this being a problem when some settings were changed in Ubuntu in one of the recent version. It caused trouble for T4T as well, until a Virtualbox update solved the issue. Again, pure speculation on my part.


Thinking further about it, I remember the issues that cropped up with T4T were as a result of changes in how Ubuntu deal with resolv.conf

Info on that here:
http://www.stgraber.org/2012/02/24/dns-in-ubuntu-12-04/

Could that be the source of problems for Ubuntu users?
22) Message boards : Number crunching : Big problem: work units running with negative time. (Message 25803)
Posted 7 Sep 2013 by jelle
Post:
I run BOINC 7.0.65 on openSUSE 12.3. It runs just fine. Did you install the correct libraries? You need these:

libwx_baseu-2_8-0-wxcontainer-2.8.12-17.1.1.x86_64
libwx_baseu_net-2_8-0-wxcontainer-2.8.12-17.1.1.x86_64
libwx_gtk2u_adv-2_8-0-wxcontainer-2.8.12-17.1.1.x86_64
libwx_gtk2u_core-2_8-0-wxcontainer-2.8.12-17.1.1.x86_64
libwx_gtk2u_html-2_8-0-wxcontainer-2.8.12-17.1.1.x86_64
wxWidgets-wxcontainer-compat-lib-config-2.8.12-17.1.1.x86_64


I tried again after installing some of the missing libraries listed above. It did not make a difference.

I then realised those libraries relate to the graphics widgets. This is presumably relevant for getting BOINC manager to display properly, but should not be relevant, I expect, for the BOINC tasks. I should also note that BOINC is fine on my machines, and continues to happily crunch away on Einstein, Rosetta, eON, and Test4Theory. It's only the recent LHC tasks, from the latest version, that crash out with these "lack of heartbeat" errors.

As I'm typing this I wonder if it has anything to do with DNS assumptions. I remember this being a problem when some settings were changed in Ubuntu in one of the recent version. It caused trouble for T4T as well, until a Virtualbox update solved the issue. Again, pure speculation on my part.
23) Message boards : Number crunching : All tasks end with error (Message 25788)
Posted 6 Sep 2013 by jelle
Post:
Yep. That is the same problem as discussed in this thread:
http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3767

The errors and error messages described here are the same
24) Message boards : Number crunching : Big problem: work units running with negative time. (Message 25774)
Posted 6 Sep 2013 by jelle
Post:
Out of curiosity, I installed WINE and tried if that made a difference. Unfortunately, it doesn't. Behaviour remains as before: tasks keep looping.
25) Message boards : Number crunching : Big problem: work units running with negative time. (Message 25769)
Posted 6 Sep 2013 by jelle
Post:
Looking a bit more at what is happening. The executable file that gets downloaded on my Linux machine is:
sixtrack_lin64_4463_pni.exe

This is identified as a DOS/Windows executable by the file manager. It that is indeed what it is, then it's not surprising that Linux cannot execute it. It might conceivably be executable if I had WINE installed, but I do not, and I have not tried that. Perhaps other Linux users do have WINE installed and therefore get to execute it anyway, but that is just speculation.

I also note that the executable attribute bit is not set for the .exe file after the download.
26) Message boards : Number crunching : Big problem: work units running with negative time. (Message 25764)
Posted 5 Sep 2013 by jelle
Post:
Same here. I tried again this morning on my office computer, after Ubuntu pushed out a network manager update and I had rebooted the computer.

Results from the log file are:
Fri 06 Sep 2013 08:50:17 NZST | LHC@home 1.0 | Task W0w20cbb_w30cbb63__36__s__64.31_59.32__13_13.5__6__3_1_sixvf_boinc16640_1 exited with zero status but no 'finished' file

Fri 06 Sep 2013 08:50:17 NZST | LHC@home 1.0 | If this happens repeatedly you may need to reset the project.

Fri 06 Sep 2013 08:50:17 NZST | LHC@home 1.0 | Restarting task W0w20cbb_w30cbb63__36__s__64.31_59.32__13_13.5__6__3_1_sixvf_boinc16640_1 using sixtrack version 44603 (pni) in slot 1

This repeats endlessly.

I'm running Xubuntu 12.04 (precise) 64-bit.
Linux Kernel 3.2.0-52-generic
BOINC version 7.0.65(x64)

As a side note, one of my home computers was still able to successfully complete an old LHC task this morning. Therefore, the problem is definitely caused by the new version.
27) Message boards : Number crunching : Big problem: work units running with negative time. (Message 25758)
Posted 4 Sep 2013 by jelle
Post:
Good morning from New Zealand.
I just came into office where my office computer crunches LHC too overnight. All of the recent tasks had crashed out with computation error or were stuck in the same (almost endless) loops.

Example is Task: 19624429

You will see it crashed out with a computation error after too many exits.

I have now seen this problem on 4 different computers at 2 different locations, so I don't think it's just me.

I run Xubuntu Linux on all my machines, versions 12.04 on 3 of them and 12.10 on 1.

Please do let me know when this problem is fixed, because I'm suspending LHC for now.
28) Message boards : Number crunching : Big problem: work units running with negative time. (Message 25744)
Posted 4 Sep 2013 by jelle
Post:
I am getting a completely new experience with many of the new work units. They run to about 0.030% completion in 14 seconds on my computer, and then go back to zero %. The elapsed time in all those cases, as displayed in BOINC manager, also jumps back from around 14 to 4 seconds. Something I did not know was possible. In other words, elapsed time is jumping backwards.

I'm aborting those jobs now, because they just seem to be completely stuck without progress. Because it's my bed time I will suspend LHC for now. Please let me know if and when it is safe to resume LHC again.
29) Message boards : Number crunching : Computation error on network loss (Message 25701)
Posted 27 Aug 2013 by jelle
Post:
Just happened again. Got a computation error after 41 hours of crunching. Really bummed out by that.

Why does LHC need to maintain a connection while it is working on a job? If it does make connections, can it at least set a check point prior to doing so, and revert to that checkpoint if a connection fails?

I considered the suggestion to use "network activity suspended" but I don't think that would help. I also crunch Test-4-Theory and that requires a continuous connection; for reasons that I do understand. Moreover, I could still get a computation error if my ISP or my wireless gets the wobbles while I am actually online and using the machine.

I don't understand why network loss should lead to a computation error?
30) Message boards : Number crunching : Computation error on network loss (Message 25699)
Posted 24 Aug 2013 by jelle
Post:
Interesting. I didn't know Linux gave a speed gain. That is some consolation for the occassional loss. However, it may still be good to figure out how Einstein manages to avoid the same problems.
31) Message boards : Number crunching : Computation error on network loss (Message 25696)
Posted 24 Aug 2013 by jelle
Post:
Any suggestions on how to avoid computation errors as a result of network hiccups?

I just lost 28 hours of crunching on a valiantly working Intel Atom-powered netbook because our Telecom provider had some hiccups. I realize this is a known problem with LHC, but my Einstein tasks (which also run for a long time) are not bothered by a temporary loss of connection. Is there any way to solve this for LHC as well?
32) Message boards : Number crunching : Computation errors (Message 25586)
Posted 16 May 2013 by jelle
Post:
Thank you for your reaction. You have deduced correctely that I am running Linux. I use Xubuntu 12.04 and 12.10.

I am familiar with computation errors as a result of losing the network connection. That happens from time to time.

I was not familiar with the second reason you mention. Because of the sudden LHC workload I increased the number of CPUs that BOINC can use. With hyper-threading on Intel CPUs that means the percentage of logical CPUs for BOINC exceeded the physical CPUs (which I usually avoid). It sounds plausible that could increase the number of computation errors if BOINC is sensitive to that. I have now cranked down the CPU use percentage again, so let's see if the problem goes away.

Thanks for the advice.
33) Message boards : Number crunching : Computation errors (Message 25584)
Posted 14 May 2013 by jelle
Post:
I am getting an unusually high number of computation errors with recent tasks. On all 3 machines that I use for SixTrack. The message on the task page is:

Stderr output

<core_client_version>7.0.65</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>

</stderr_txt>
]]>


I have no clue what that means. Any suggestions?
34) Message boards : Number crunching : Long WU's (Message 24627)
Posted 17 Aug 2012 by jelle
Post:
.... If I'm better off aborting the task and starting on something else let me know. I don't like wasting CPU time, even if it's a slow CPU.

Because there are already two validated tasks for that workunit, the quorum is already complete and you should immediately abort your now unnecessary copy.

If it could be completed before deadline, it could also receive credit. Once the deadline passes, you will not receive credit so (from what you say) you should abort it immediately and stop wasting time.


You're right. I'm just seeing it listed as an Error result because it timed out. I'm not at home now, so it may still be crunching away. I'll abort it when I get home. Sad for the wasted week of computing time. Older, yes; wiser, maybe.
35) Message boards : Number crunching : Long WU's (Message 24613)
Posted 16 Aug 2012 by jelle
Post:
Thanks. I will let it run a bit more and see what happens. It's this task:
http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=5623859

It has the misfortune of running on my slowest machine, an Atom-powered netbook. I see 2 other people completed the WU for that task already, so I don't know if I will get any credit at all. Bummer if I don't, but fortunately that is not my motivation.

Looking just now the task has run for almost 142 hours, reports that is has another 114 hours to go, and has only 18 hours until the deadline.

Previously I aborted several tasks that had extremely long running times, but then I noticed that after a very slow start they tended to accellerate and complete in a decent time anyway. Which is why I let this one run on too, in the hope it would have the same pattern. Unfortunately, it is accellerating, but only very, very slowly.

Let's see what happens. If I'm better off aborting the task and starting on something else let me know. I don't like wasting CPU time, even if it's a slow CPU.
36) Message boards : Number crunching : Long WU's (Message 24601)
Posted 15 Aug 2012 by jelle
Post:
So what happens when a task does run beyond its deadline? Apologies if this has been asked before.

I have a task that is only 45% completed and close to its deadline. Should I abort it now and stop wasting time or should I let it run?

I don't care about credits very much, although more credits are always better.
37) Message boards : Number crunching : Faulty Computers or Modified BOINC ?? Huge Credits (Message 23744)
Posted 24 Nov 2011 by jelle
Post:
The erroneous credit should be corrected as soon as possible but we should make sure malicious cheating was done before accusing anyone of cheating. (I'm from Canada where we believe in innocent until proven guilty.)


I agree that erroneous credit should be deleted as soon as possible.

I have little hesitation, however, in proposing drastic action against potential cheats. There is no need to comply with legal procedures or presumptions. I think project admins should be able to take any action they feel is in the best interest of the project; and they don't even have to be nice about it if they don't want to.

If that means an occassional innocent gets penalised then that is regrettable, but if said innocent cares about the science it should not matter. He/She can just create a new user-id and contribute again. There are no real-world consequences or harms inflicted.

I do think the potential presence of cheats fouls up the eco-system and could discourage non-cheats from contributing more. Any deterrent that prevents that, or reduces the risk, is a good thing.


Previous 20


©2024 CERN