Message boards :
Number crunching :
Can anyone explain ......
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jul 05 Posts: 9 Credit: 770,253 RAC: 0 |
...... why this system seems to be successfully processing units within one minute, while other people are taking several hours ? I say "seems to be successful" as the claimed and awarded credit is very low; also, quite a few of the results have been declared invalid (and quite a few of the non-invalid ones seem to invalid as well, or am I mis-reading things ?). If it's working fine, I don't have any issues; but I don't think it is, and it's grabbing a hellova lot of units that it seems to be screwing up. If I'm wrong, please let me know where I'm mis-interpreting the results :) Mike |
![]() Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 ![]() |
The stderr output from those results have the debugger information from the *application crash*. Those results crashed shortly after starting. |
Send message Joined: 15 Jul 05 Posts: 9 Credit: 770,253 RAC: 0 |
Thought that was the case; thanks for confirming :) Shame we can't get this system excluded, seeing as it's screwing up so badly; be more units for the rest of us as well ;) |
![]() Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 ![]() |
Thought that was the case; thanks for confirming :) Strange that with those errors there, it's still not recognized as "Computing error" by the client. The validator would get rid of those right away. |
Send message Joined: 26 Aug 05 Posts: 18 Credit: 37,965 RAC: 0 |
Thought that was the case; thanks for confirming :) I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows. |
![]() Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 ![]() |
I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows. Using a SETI optimized app to run LHC?! Oh my god... Using optimized apps should be more difficult, so people actually need a higher IQ and/or knowledge to do it. |
Send message Joined: 26 Aug 05 Posts: 18 Credit: 37,965 RAC: 0 |
Host was working in February; that's a long time to not notice the error. one good thing is that since they don't validate, William gets no credit for the results. -a. |
![]() Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 ![]() |
Host was working in February; that's a long time to not notice the error. Oh sorry, I thought they had been validated and granted credit. Found one finished, granted 0 credit (the rest are still pending to be rejected). |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows. >shudder< But actually, on reflection I think it is more likely that they are using a non-standard client (as well as a non-standard app). It is the client that should report the work as "Client error" rather than success. I would hope that the SETI app is sufficiently well-written not to claim a success when the input files are missing/wrong format/etc. Optimisation, when if goes wrong, can certainly produce misleading answers to calculations, but should not alter basic values like return codes. My best guess is that there is a re-coded client behind these oddities, not just an over zealous set of optimisation settings. I only claim this as a guess tho... R~~ |
![]() Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 ![]() |
I think maybe this William person is using the anon. platform mechanism to force the lhc workunits to be run by an optimized seti app. Most likely they were trying to use an optimized app for seti, and just screwed up. Who knows. Then it could be the client mixing up slots. I see setiathome named on stderr text from all results on that host. |
![]() ![]() Send message Joined: 6 Jul 06 Posts: 108 Credit: 663,175 RAC: 0 |
> I was about to report this host when I found this thread. One problem that comes out of this, is that since it is usually the first host to report its 'results' then the lower claim of the next 2 results that get reported will be the credit that everyone gets. It screws up the credit granted to a lot of other people/hosts. |
![]() Send message Joined: 16 Dec 05 Posts: 18 Credit: 1,525,497 RAC: 1 ![]() |
Heres another one, user 2860753 . His results ar`nt affecting anyones. |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
> I was about to report this host when I found this thread. It would if the validator accepted the result. Happily it does not - look at any of these results and the validate state is either Initial (not looked at yet) or Invalid (looked at and rejected). If you see an example where the duff result shows validate status "Vaild" please post a link to the result here. The one I looked at was "Invalid", which is what it should be. You are right that the validator will run when it gets the first three back, but after it rejects this result, it waits for the fourth to come in and tries again. The granted credit should be the median (middle value) of the claim from the first three results in that are both "Success" and "Valid". If you see an example where this has not happened please post a link to that wu, but the ones I have looked at seem ok. There is an issue in that the result should not be returned as a "Success", but the science and the credit are both correctly protected by the validator, at least in the cases I looked at. Hope that reassures. EDIT: What could be a problem is that the quota is not reduced for an invalid result, so these rogue machines can get through up to 500 WU every day, without being quota chopped the way they would be if the results were returned as "Error". River~~ |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Is there not also a possible problem that the max errors per work unit could be reached, thereby invalidating all results for that WU. I suppose that it is a very unlikely problem - given high number of errors allowed relative to the quorum size. max # of error/total/success results 10, 20, 10 |
Send message Joined: 26 Aug 05 Posts: 18 Credit: 37,965 RAC: 0 |
If you see an example where the duff result shows validate status "Vaild" please post a link to the result here. This unit shows an invalid result marked valid and granted credit. This is the same host that started this thread, ID 88058. I suspect that what happened here still doesn't affect the science. -a. |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
As you say, a possible problem, and it is a very good point. I think that you are too confident about how likely or unlikely it is -- this may depend critically on the number of rogue machines. With a dozen rogue machines we'd be seeing a double infection (ie two different rogue machines) in about 1/8th of all the infected WU(*). We don't see that at present - I have seen a lot of WU that have been affected, but so far none that has been affected by two or more of these rogues. I conclude that we have less than 12 of these rogues at present. Notice that a double infection would not be enough to cause a problem for us - we are allowed 10 success results, and need only 3; so that it would actually take 8 invalid successes to kill a WU. With exactly 8 rogue machines, all of them are needed to kill a WU, and it is unlikley that they will all come together. So how many rogues are needed before they become dangerous? My guess is 1.5x or 2.5x the bare minuimum, ie 12 to 20 machines would be the danger point (+) So I think we are safe at the moment, but I also think it would only need a few more rogue machines to really cause trouble. River~~ ______________ (*) based on the observation that the rogue boxes seem to take around 100-200 results when there is work, and that 12 would therefore take 1200-2400 results. In the last release of work we had about 8000 WU (40,000 results). Say 1 in 8 wu contains an affected result; of these around 1 in 8 would contain more than one affected result (affected by different machines, of course). ______________ (+) guess, 12 to 16 rogue machines. With 12 rogue machines and 8 of them needed to kill the WU, we have 495 ways of choosing which 8 machines will conspire together to do the deed. With 16 rogue machines and only 8 needed, we have 12,870 ways to kill the WU. With 20 rogue machines and still only 8 needed to kill a WU, we have 126,000 ways of choosing the 8 machines. It is this so-called "combinatorial explosion" that makes a small difference in the number of machines into a vastly increased chance to do the damage. My stats is too rusty to do the exact sums, and we do not have the exact probabilities anyway, but my intuition is that somewhere between 1.5x and 2x the minimum is the danger point, so I am guessing that we are only safe with less than 12 rogue machines. |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
If you see an example where the duff result shows validate status "Vaild" please post a link to the result here. Thanks for this. I agree, as it validated it is hard to see how the science can be affected. The science results from this task must match those of the other results by LHC's extremely tight standards. I take it you mean that this result from that unit, which certainly has some debugging info in it, and some of it is similar to what we saw in the other rogue result - for example the same references to SETI! On the other hand, the runtime and credit claim here are plausible, where they were not on the previous rogue result. So my assumtpion at the moment is that the validator (which, by the way, is more fussy here than on other projects) is probably right to accept this result, and that the client sent back the right science files. If so, it must have crunched the work. At the same time it is certainly worrying that the client has sent back the wrong stderr.txt with these files. This supports PovAddict's suggestion that the client is getting its slots muddled up. Do please post other possible bad but Valid results - especially if the run time of a "Valid" result is very much lower than the other results in the same WU. River~~ |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 0 ![]() ![]() |
The stderr output from those results have the debugger information from the *application crash*. Umm... interesting: *** Foreground Window Data *** Window Name : Flight Following - Microsoft Internet Explorer Window Class : IEFrame So now we know what website the owner was looking at on 15/11? Henry |
![]() Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 ![]() |
The stderr output from those results have the debugger information from the *application crash*. Yes we do. See if you find anything interesting, like somebody watching pr0n at the moment sixtrack crashed (!). Why the debugger needs to get Foreground Window Data, I don't know. |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
... Why the debugger needs to get Foreground Window Data, I don't know. To notice if many crashes are connected with one particular piece of software, perhaps? But that seems like a long shot to me. River~~ |
©2025 CERN