Message boards : Number crunching : Computation error on WU
Message board moderation

To post messages, you must log in.

AuthorMessage
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 677
Credit: 43,745,870
RAC: 15,287
Message 3986 - Posted: 19 Oct 2004, 7:38:12 UTC

Hi,
here's an extract from my stderr.txt:

2004-10-18 14:47:22 [LHC@home] Unrecoverable error for result v64lhc1000pronine-41s12_14526.73_1_sixvf_2295_1 (There are no child processes to wait for. (0x80) - exit code 128 (0x80))

Does anybody have idea what this means?

The result uploaded to server OK though. The WU calculation took 1h 15 min as they usually take 1h 30 min. Using Boinc 4.13, set up to Seti 66.6% and LHC 33.3%.
ID: 3986 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,691,526
RAC: 12
Message 3989 - Posted: 19 Oct 2004, 8:09:03 UTC

Your computers are hidden so I can't see what OS you are running. Is it by any chance linux? I stopped running LHC on my P3-500 running knoppix because 90% of the work units errored out like that when they were just about to complete. I'm wondering if there is a problem with the linux client. The same setup is doing seti@home work units without incident. On the other hand, my other linux box which is running gentoo does not seem to be having the same problems so I'm not sure what to think. Maybe something about knoppix gives bad mojo?


--------------------------------------
A member of The Knights Who Say Ni!
My BOINC stats site
ID: 3989 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 677
Credit: 43,745,870
RAC: 15,287
Message 3991 - Posted: 19 Oct 2004, 8:49:57 UTC - in response to Message 3989.  

Now my computer should be visible. It is running Windows 2000 SP4. I don't have any Linux systems available.
ID: 3991 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,691,526
RAC: 12
Message 3992 - Posted: 19 Oct 2004, 8:58:18 UTC
Last modified: 19 Oct 2004, 8:59:08 UTC

Darn! I was hoping for a trend. oh well :) Guess my exit codes are different than yours as well. I think I posted about this in another thread already actually... Ah yes, here it is.

2004-10-15 01:50:45 [LHC@home] Unrecoverable error for result v64lhc1000profour17s12_14518.18_1_sixvf_8194_9 (process exited with code 12 (0xc))

Most of the ones I see right now are code 12. a few are code 240 (0xf0)

Maybe the admins could post a list of error codes and what they mean? could help to track down the problem and see if it is a configuration/library/hardware problem on the client machine or maybe a bug in the software. Anyone? :)


--------------------------------------
A member of The Knights Who Say Ni!
My BOINC stats site
ID: 3992 · Report as offensive     Reply Quote
grumpy

Send message
Joined: 1 Sep 04
Posts: 57
Credit: 2,834,109
RAC: 57
Message 4616 - Posted: 29 Oct 2004, 16:24:41 UTC

LHC@home - 2004-10-29 11:12:58 - Restarting result v64lhc1000proeleven-55s14_16518.45_1_sixvf_27181_1 using sixtrack version 4.47
LHC@home - 2004-10-29 12:11:50 - Unrecoverable error for result v64lhc1000proeleven-55s14_16518.45_1_sixvf_27181_1 ( - exit code -1 (0xffffffff))
LHC@home - 2004-10-29 12:11:50 - Computation for result v64lhc1000proeleven-55s14_16518.45_1_sixvf_27181 finished

That was a popup windows error stating invalid_page_fault for sixtrack :win 98
ID: 4616 · Report as offensive     Reply Quote
grumpy

Send message
Joined: 1 Sep 04
Posts: 57
Credit: 2,834,109
RAC: 57
Message 4657 - Posted: 30 Oct 2004, 16:18:41 UTC

sIXTRACK_4 caused a stack fault in module
SIXTRACK_4.47_WINDOWS_INTELX86.EXE at 0177:00525f4f.
Registers:
EAX=00000008 CS=0177 EIP=00525f4f EFLGS=00010202
EBX=00000000 SS=017f ESP=04452000 EBP=0445201c
ECX=0065cba8 DS=017f ESI=0065d198 FS=38ff
EDX=01c4be98 ES=017f EDI=00000000 GS=0000
Bytes at CS:EIP:
56 57 8b 45 f8 89 65 e8 50 8b 45 fc c7 45 fc ff
Stack dump:
00000000 00000000 00000000 04452050 00525f88 00527f86
005f8808 0445202c 00528033 00000004 00000018 04452060
005256fd 00000004 00000000 0065d198
ID: 4657 · Report as offensive     Reply Quote
Profile Richard Cox

Send message
Joined: 23 Oct 04
Posts: 7
Credit: 58,953
RAC: 0
Message 4666 - Posted: 30 Oct 2004, 18:24:13 UTC - in response to Message 3986.  
Last modified: 30 Oct 2004, 18:28:19 UTC

> Hi,
> here's an extract from my stderr.txt:
>
> 2004-10-18 14:47:22 [LHC@home] Unrecoverable error for result
> v64lhc1000pronine-41s12_14526.73_1_sixvf_2295_1 (There are no child processes
> to wait for. (0x80) - exit code 128 (0x80))
>
> Does anybody have idea what this means?
>
> The result uploaded to server OK though. The WU calculation took 1h 15 min as
> they usually take 1h 30 min. Using Boinc 4.13, set up to Seti 66.6% and LHC
> 33.3%.
>
>
Harri, the error codes are put out by the OS although generated by the application for a variety of reasons from hardware failure to software bugs. You might find lists on the Microsoft web site; they come in many flavors. Since is was near the end of the calculation, my guess is that is was some IO error. You seem to have plenty of RAM; what mobo are you using and what is the chipset?
ID: 4666 · Report as offensive     Reply Quote
Profile Richard Cox

Send message
Joined: 23 Oct 04
Posts: 7
Credit: 58,953
RAC: 0
Message 4667 - Posted: 30 Oct 2004, 18:29:42 UTC - in response to Message 3992.  
Last modified: 30 Oct 2004, 18:30:37 UTC

duplicate message deleted.
ID: 4667 · Report as offensive     Reply Quote
Profile Richard Cox

Send message
Joined: 23 Oct 04
Posts: 7
Credit: 58,953
RAC: 0
Message 4668 - Posted: 30 Oct 2004, 18:29:43 UTC - in response to Message 3992.  
Last modified: 30 Oct 2004, 18:32:28 UTC

> Darn! I was hoping for a trend. oh well :) Guess my exit codes are
> different than yours as well. I think I posted about this in another thread
> already actually... Ah yes, here it is.
>
> 2004-10-15 01:50:45 [LHC@home] Unrecoverable error for result
> v64lhc1000profour17s12_14518.18_1_sixvf_8194_9 (process exited with code 12
> (0xc))
>
> Most of the ones I see right now are code 12. a few are code 240 (0xf0)
>
> Maybe the admins could post a list of error codes and what they mean? could
> help to track down the problem and see if it is a
> configuration/library/hardware problem on the client machine or maybe a bug in
> the software. Anyone? :)
>

> --------------------------------------
> A member of The
> Knights Who Say Ni!

> My BOINC stats site
>
Tobi, which of your five computers got this error? There may be a trend if it was the Pentium running Linux; the error may have been generated in the math processor of the chip or some IO glitch that it didn't cope with.
ID: 4668 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,691,526
RAC: 12
Message 4669 - Posted: 30 Oct 2004, 18:54:55 UTC

It was the Pentium 3 linux box running knoppix. I had the boinc directory mounted via an SMB share to my windows box so as not to lose work if I had to reboot the knoppix box. (it has no hard drive) I believe I was seeing the same bugs that exist with NFS. See here for details. I decided to just take LHC off of that one for the time being and put it on seti which runs without any problems over a network mount.


--------------------------------------
A member of The Knights Who Say Ni!
My BOINC stats site
ID: 4669 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 4676 - Posted: 31 Oct 2004, 0:28:30 UTC - in response to Message 4669.  

Ever since the V4.13 BOINC, I'm seeing an increased number of various "Computing Errors" on random Win32 and Linux boxes (SETI and LHC, but luckily not with CPDN).

Seems there are some Bugs needed to be fixed.
___________________________________________
<p>Scientific Network : 36200 MHz �� 8204 MB �� 815.0 GB </p>
ID: 4676 · Report as offensive     Reply Quote
Jason

Send message
Joined: 18 Sep 04
Posts: 7
Credit: 13,292
RAC: 0
Message 4700 - Posted: 31 Oct 2004, 12:45:27 UTC - in response to Message 4669.  

I'm running
BOINC 4.13
GenuineIntel 997MHz Pentium
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

4 of my last 10 WUs came back with a computation error similar to this:

Unrecoverable error for result v64lhc1000prosix-27s10_12551.21_1_sixvf_30583_3 (exit code -1073741819(0xc0000005)

2 of my last 8 SETI workunits generated similar errors. I'm also running the classic version of Climate prediction, and I'm still attached to Predictor, even though they've been out of order for over a month now.

Am I doing more harm than good staying on this project? Should I detach?

Thanks,
Jason


> It was the Pentium 3 linux box running knoppix. I had the boinc directory
> mounted via an SMB share to my windows box so as not to lose work if I had to
> reboot the knoppix box. (it has no hard drive) I believe I was seeing the
> same bugs that exist with NFS. See <a> href="http://lhcathome.cern.ch/known_bugs.html">here[/url] for details. I
> decided to just take LHC off of that one for the time being and put it on seti
> which runs without any problems over a network mount.
>

> --------------------------------------
> A member of The
> Knights Who Say Ni!

> My BOINC stats site
>
ID: 4700 · Report as offensive     Reply Quote
grumpy

Send message
Joined: 1 Sep 04
Posts: 57
Credit: 2,834,109
RAC: 57
Message 4952 - Posted: 6 Nov 2004, 15:41:04 UTC


I am investigating this problem on my win 98 machine.
I think I may found the problem: regionnal settings, clock, date formats, etc.
( so far so good).
ID: 4952 · Report as offensive     Reply Quote

Message boards : Number crunching : Computation error on WU


©2024 CERN