Message boards : Number crunching : Can anyone explain ......
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Mike Dunn

Send message
Joined: 15 Jul 05
Posts: 9
Credit: 770,253
RAC: 0
Message 15515 - Posted: 18 Nov 2006, 14:56:54 UTC

...... why this system seems to be successfully processing units within one minute, while other people are taking several hours ?

I say "seems to be successful" as the claimed and awarded credit is very low; also, quite a few of the results have been declared invalid (and quite a few of the non-invalid ones seem to invalid as well, or am I mis-reading things ?).

If it's working fine, I don't have any issues; but I don't think it is, and it's grabbing a hellova lot of units that it seems to be screwing up. If I'm wrong, please let me know where I'm mis-interpreting the results :)

Mike
ID: 15515 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15520 - Posted: 18 Nov 2006, 15:46:44 UTC

The stderr output from those results have the debugger information from the *application crash*. Those results crashed shortly after starting.
ID: 15520 · Report as offensive     Reply Quote
Mike Dunn

Send message
Joined: 15 Jul 05
Posts: 9
Credit: 770,253
RAC: 0
Message 15522 - Posted: 18 Nov 2006, 17:03:26 UTC

Thought that was the case; thanks for confirming :)

Shame we can't get this system excluded, seeing as it's screwing up so badly; be more units for the rest of us as well ;)
ID: 15522 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15523 - Posted: 18 Nov 2006, 17:39:10 UTC - in response to Message 15522.  

Thought that was the case; thanks for confirming :)

Shame we can't get this system excluded, seeing as it's screwing up so badly; be more units for the rest of us as well ;)

Strange that with those errors there, it's still not recognized as "Computing error" by the client. The validator would get rid of those right away.
ID: 15523 · Report as offensive     Reply Quote
uioped1

Send message
Joined: 26 Aug 05
Posts: 18
Credit: 37,965
RAC: 0
Message 15534 - Posted: 18 Nov 2006, 19:33:48 UTC - in response to Message 15523.  

Thought that was the case; thanks for confirming :)

Shame we can't get this system excluded, seeing as it's screwing up so badly; be more units for the rest of us as well ;)

Strange that with those errors there, it's still not recognized as "Computing error" by the client. The validator would get rid of those right away.


I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows.
ID: 15534 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15535 - Posted: 18 Nov 2006, 19:41:37 UTC - in response to Message 15534.  

I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows.

Using a SETI optimized app to run LHC?! Oh my god... Using optimized apps should be more difficult, so people actually need a higher IQ and/or knowledge to do it.
ID: 15535 · Report as offensive     Reply Quote
uioped1

Send message
Joined: 26 Aug 05
Posts: 18
Credit: 37,965
RAC: 0
Message 15536 - Posted: 18 Nov 2006, 19:42:29 UTC - in response to Message 15534.  

Host was working in February; that's a long time to not notice the error.
one good thing is that since they don't validate, William gets no credit for the results.

-a.
ID: 15536 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15539 - Posted: 18 Nov 2006, 19:45:01 UTC - in response to Message 15536.  

Host was working in February; that's a long time to not notice the error.
one good thing is that since they don't validate, William gets no credit for the results.

-a.

Oh sorry, I thought they had been validated and granted credit. Found one finished, granted 0 credit (the rest are still pending to be rejected).
ID: 15539 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15543 - Posted: 18 Nov 2006, 20:09:56 UTC - in response to Message 15535.  

I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows.

Using a SETI optimized app to run LHC?! Oh my god... Using optimized apps should be more difficult, so people actually need a higher IQ and/or knowledge to do it.


>shudder<

But actually, on reflection I think it is more likely that they are using a non-standard client (as well as a non-standard app). It is the client that should report the work as "Client error" rather than success. I would hope that the SETI app is sufficiently well-written not to claim a success when the input files are missing/wrong format/etc.

Optimisation, when if goes wrong, can certainly produce misleading answers to calculations, but should not alter basic values like return codes. My best guess is that there is a re-coded client behind these oddities, not just an over zealous set of optimisation settings. I only claim this as a guess tho...

R~~
ID: 15543 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15544 - Posted: 18 Nov 2006, 20:12:58 UTC - in response to Message 15543.  

I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows.

Using a SETI optimized app to run LHC?! Oh my god... Using optimized apps should be more difficult, so people actually need a higher IQ and/or knowledge to do it.


>shudder<

But actually, on reflection I think it is more likely that they are using a non-standard client (as well as a non-standard app). It is the client that should report the work as "Client error" rather than success. I would hope that the SETI app is sufficiently well-written not to claim a success when the input files are missing/wrong format/etc.

Optimisation, when if goes wrong, can certainly produce misleading answers to calculations, but should not alter basic values like return codes. My best guess is that there is a re-coded client behind these oddities, not just an over zealous set of optimisation settings. I only claim this as a guess tho...

R~~

Then it could be the client mixing up slots. I see setiathome named on stderr text from all results on that host.
ID: 15544 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 108
Credit: 661,871
RAC: 196
Message 15558 - Posted: 19 Nov 2006, 0:49:17 UTC

> I was about to report this host when I found this thread.
One problem that comes out of this, is that since it is usually the first host to report its 'results' then the lower claim of the next 2 results that get reported will be the credit that everyone gets.
It screws up the credit granted to a lot of other people/hosts.
ID: 15558 · Report as offensive     Reply Quote
Profile Bird-Dog

Send message
Joined: 16 Dec 05
Posts: 18
Credit: 1,523,201
RAC: 0
Message 15572 - Posted: 19 Nov 2006, 12:56:52 UTC
Last modified: 19 Nov 2006, 12:57:40 UTC

Heres another one, user 2860753 . His results ar`nt affecting anyones.
ID: 15572 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15582 - Posted: 19 Nov 2006, 18:39:31 UTC - in response to Message 15558.  
Last modified: 19 Nov 2006, 18:43:02 UTC

> I was about to report this host when I found this thread.
One problem that comes out of this, is that since it is usually the first host to report its 'results' then the lower claim of the next 2 results that get reported will be the credit that everyone gets.
It screws up the credit granted to a lot of other people/hosts.


It would if the validator accepted the result. Happily it does not - look at any of these results and the validate state is either Initial (not looked at yet) or Invalid (looked at and rejected). If you see an example where the duff result shows validate status "Vaild" please post a link to the result here. The one I looked at was "Invalid", which is what it should be.

You are right that the validator will run when it gets the first three back, but after it rejects this result, it waits for the fourth to come in and tries again. The granted credit should be the median (middle value) of the claim from the first three results in that are both "Success" and "Valid". If you see an example where this has not happened please post a link to that wu, but the ones I have looked at seem ok.

There is an issue in that the result should not be returned as a "Success", but the science and the credit are both correctly protected by the validator, at least in the cases I looked at.

Hope that reassures.

EDIT: What could be a problem is that the quota is not reduced for an invalid result, so these rogue machines can get through up to 500 WU every day, without being quota chopped the way they would be if the results were returned as "Error".

River~~
ID: 15582 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 15586 - Posted: 20 Nov 2006, 0:00:09 UTC - in response to Message 15582.  

....

EDIT: What could be a problem is that the quota is not reduced for an invalid result, so these rogue machines can get through up to 500 WU every day, without being quota chopped the way they would be if the results were returned as "Error".

....


Is there not also a possible problem that the max errors per work unit could be reached, thereby invalidating all results for that WU.

I suppose that it is a very unlikely problem - given high number of errors allowed relative to the quorum size.

max # of error/total/success results 10, 20, 10
ID: 15586 · Report as offensive     Reply Quote
uioped1

Send message
Joined: 26 Aug 05
Posts: 18
Credit: 37,965
RAC: 0
Message 15589 - Posted: 20 Nov 2006, 4:22:22 UTC - in response to Message 15582.  

If you see an example where the duff result shows validate status "Vaild" please post a link to the result here.

There is an issue in that the result should not be returned as a "Success", but the science and the credit are both correctly protected by the validator, at least in the cases I looked at.

Hope that reassures.

River~~


This unit shows an invalid result marked valid and granted credit.
This is the same host that started this thread, ID 88058.

I suspect that what happened here still doesn't affect the science.
-a.
ID: 15589 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15592 - Posted: 20 Nov 2006, 12:37:31 UTC - in response to Message 15586.  


Is there not also a possible problem that the max errors per work unit could be reached, thereby invalidating all results for that WU.

I suppose that it is a very unlikely problem - given high number of errors allowed relative to the quorum size.

max # of error/total/success results 10, 20, 10


As you say, a possible problem, and it is a very good point.

I think that you are too confident about how likely or unlikely it is -- this may depend critically on the number of rogue machines.

With a dozen rogue machines we'd be seeing a double infection (ie two different rogue machines) in about 1/8th of all the infected WU(*). We don't see that at present - I have seen a lot of WU that have been affected, but so far none that has been affected by two or more of these rogues.

I conclude that we have less than 12 of these rogues at present.

Notice that a double infection would not be enough to cause a problem for us - we are allowed 10 success results, and need only 3; so that it would actually take 8 invalid successes to kill a WU.

With exactly 8 rogue machines, all of them are needed to kill a WU, and it is unlikley that they will all come together.

So how many rogues are needed before they become dangerous?

My guess is 1.5x or 2.5x the bare minuimum, ie 12 to 20 machines would be the danger point (+)

So I think we are safe at the moment, but I also think it would only need a few more rogue machines to really cause trouble.

River~~
______________
(*) based on the observation that the rogue boxes seem to take around 100-200 results when there is work, and that 12 would therefore take 1200-2400 results.
In the last release of work we had about 8000 WU (40,000 results). Say 1 in 8 wu contains an affected result; of these around 1 in 8 would contain more than one affected result (affected by different machines, of course).
______________
(+) guess, 12 to 16 rogue machines.

With 12 rogue machines and 8 of them needed to kill the WU, we have 495 ways of choosing which 8 machines will conspire together to do the deed.

With 16 rogue machines and only 8 needed, we have 12,870 ways to kill the WU.

With 20 rogue machines and still only 8 needed to kill a WU, we have 126,000 ways of choosing the 8 machines.

It is this so-called "combinatorial explosion" that makes a small difference in the number of machines into a vastly increased chance to do the damage.

My stats is too rusty to do the exact sums, and we do not have the exact probabilities anyway, but my intuition is that somewhere between 1.5x and 2x the minimum is the danger point, so I am guessing that we are only safe with less than 12 rogue machines.
ID: 15592 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15593 - Posted: 20 Nov 2006, 15:26:42 UTC - in response to Message 15589.  

If you see an example where the duff result shows validate status "Vaild" please post a link to the result here.

There is an issue in that the result should not be returned as a "Success", but the science and the credit are both correctly protected by the validator, at least in the cases I looked at.

Hope that reassures.

River~~


This unit shows an invalid result marked valid and granted credit.
This is the same host that started this thread, ID 88058.

I suspect that what happened here still doesn't affect the science.
-a.

Thanks for this. I agree, as it validated it is hard to see how the science can be affected. The science results from this task must match those of the other results by LHC's extremely tight standards.

I take it you mean that this result from that unit, which certainly has some debugging info in it, and some of it is similar to what we saw in the other rogue result - for example the same references to SETI!

On the other hand, the runtime and credit claim here are plausible, where they were not on the previous rogue result.

So my assumtpion at the moment is that the validator (which, by the way, is more fussy here than on other projects) is probably right to accept this result, and that the client sent back the right science files. If so, it must have crunched the work.

At the same time it is certainly worrying that the client has sent back the wrong stderr.txt with these files. This supports PovAddict's suggestion that the client is getting its slots muddled up.

Do please post other possible bad but Valid results - especially if the run time of a "Valid" result is very much lower than the other results in the same WU.

River~~
ID: 15593 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 15595 - Posted: 20 Nov 2006, 17:59:01 UTC - in response to Message 15520.  

The stderr output from those results have the debugger information from the *application crash*.


Umm... interesting:

*** Foreground Window Data ***
                 Window Name      : Flight Following - Microsoft Internet Explorer
                 Window Class     : IEFrame


So now we know what website the owner was looking at on 15/11?

Henry
ID: 15595 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15596 - Posted: 20 Nov 2006, 18:07:41 UTC - in response to Message 15595.  

The stderr output from those results have the debugger information from the *application crash*.


Umm... interesting:

*** Foreground Window Data ***
                 Window Name      : Flight Following - Microsoft Internet Explorer
                 Window Class     : IEFrame


So now we know what website the owner was looking at on 15/11?

Henry

Yes we do. See if you find anything interesting, like somebody watching pr0n at the moment sixtrack crashed (!). Why the debugger needs to get Foreground Window Data, I don't know.
ID: 15596 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15600 - Posted: 20 Nov 2006, 20:00:13 UTC - in response to Message 15596.  

... Why the debugger needs to get Foreground Window Data, I don't know.


To notice if many crashes are connected with one particular piece of software, perhaps?

But that seems like a long shot to me.

River~~
ID: 15600 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Can anyone explain ......


©2024 CERN