Message boards : Number crunching : Machine with a 40 Day Cache that doesn't Timeout?????
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 16236 - Posted: 6 Feb 2007, 6:06:02 UTC
Last modified: 6 Feb 2007, 7:00:57 UTC

If you take a look at the results list for this machine, you will see something startling. It's a dual processor Opteron 240 and is part of a group of machines owned by this user.

There are a total of 1091 results listed for the machine of which around 500 were issued around 1:00 to 2:00 UTC on 29-Dec-2006. According to the results list, each result was taking close to 14,000 secs (around 3.8 hours) to complete. Taking in to account the 2 processors, 500 results represents about 40 days of work per processor (250 X 3.8 / 24 = 39.58 days).

That in itself represents quite a feat. How do you convince a server to send you 40 days work where the deadline is usually around 6-7 days? I guess the answer to that is to convince the server that you can do the work 10 times faster than you actually can and then get the server to send you 4 days of work in one big hit before a result gets crunched and BOINC finds out you were lying :).

Normally this sort of behaviour is quite futile as the server will cut off all the outstanding work after the deadline expires and all you will succeed in achieving is 33 days of wasted work that might have to be sent out again, if a quorum hasn't been formed. That's correct isn't it?? So how come this machine was still returning work and being granted credit a full 4 weeks after the work was issued??

Or am I missing something here???


Cheers,
Gary.
ID: 16236 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 16239 - Posted: 6 Feb 2007, 7:03:35 UTC - in response to Message 16236.  

...
Normally this sort of behaviour is quite futile as the server will cut off all the outstanding work after the deadline expires and all you will succeed in achieving is 33 days of wasted work that might have to be sent out again, if a quorum hasn't been formed. That's correct isn't it??



[quibble]
You might in theory get credit if you get the work back before the replacement work is returned. For this to happen after a month would need 3 or 4 other machines to have been issued the work in succession without returning it. The reason for this being allowed is that the user gets credit for a late result if it made a practical difference to the project.
[/quibble]

However that is not what is happening here, looking at just this one WU for example, four results went back by 1st Jan, and SoulFly's result went back on 25th.

It seems to me that they have switched off the test for a result being past the deadline.

I think I know why - there was a bug in that test and around Oct? Nov? 2006 people were complaining that work sent back after quorum was being scored zero for alledly late even when it was not. Looks like they either made a mistake fixing that bug, or more likely just turned the test off entirely not having time to look at it properly. If I am right, then chalk up another low priority task for our new admins.

Good call Gary - & its good to be seeing your posts agian

River~~
(you may remember me as Gravywavy on Einstein)
ID: 16239 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 16241 - Posted: 6 Feb 2007, 8:44:54 UTC

ps (too late to edit)

It is also worth saying that, in my opinion, this was almost certainly not a deliberate exploit. For one thing, only one of this users boxes seems to have been affected, and some of his other boxes have got and returned reasonable amounts of LHC work since.

The run length of work on this project is only loosely connected with the estimates from the server - my boxes tend to have DCFs (duration correction factors) of around 1.7 to 1.9 most of the time, but sometimes there is a run of short-running work and every so often one box will slip to a figure well under 1.

This means that it will pick up more than my intended cache size next time it fills up, as the chances are the next work will put it back to a DCF of 1.7.

R~~
ID: 16241 · Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 16244 - Posted: 6 Feb 2007, 10:07:14 UTC - in response to Message 16239.  


It seems to me that they have switched off the test for a result being past the deadline.


I think this is precisely the case. I've examined some work that one of my machines received on 10-Jan which was work resent as a result of a quorum not being completed for some 03-Jan work. Three of the machines in the original 03-Jan quorum did not return the work by the deadline so three followup results were sent out on 10-Jan. Those three extra results were returned almost immediately and then on 12-Jan one of the original three defaulters returned its result and was awarded credit. I've seen this sort of thing several times now and I haven't been able to find a single example of late returned work being denied credit.

This needs to be given a high priority for fixing as it's otherwise going to be exploited mercilessly and will make the orderly distribution of new work a total shambles.

If I am right, then chalk up another low priority task for our new admins.


I reckon you are right but make that high priority rather than low.

(you may remember me as Gravywavy on Einstein)


Of course I know exactly who you are :). I lurk on these boards occasionally and will sometimes post if I get sufficiently stirred up about something :). I certainly read everything you post if I happen to be around at the time because I know it will be worth reading :).



Cheers,
Gary.
ID: 16244 · Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 16246 - Posted: 6 Feb 2007, 11:34:57 UTC - in response to Message 16241.  

It is also worth saying that, in my opinion, this was almost certainly not a deliberate exploit.


I don't really care if it was deliberate or not and I'm certainly not making any accusations on that score. My interest is really more in identifying deficiencies in the work distribution process that need to be rectified.

For one thing, only one of this users boxes seems to have been affected...


Actually not correct if you look closely. I've found examples of post deadline work being granted credit on each of the five most recently communicating boxes in the list. This one is particularly interesting. It received new work on 31-Dec, 03-Jan, 04-Jan and 08-Jan. All of the January work with the exception of two WUs was completed and returned between 07-Jan and 11-Jan. Two 08-Jan WUs were returned on 15-Jan. The interesting bit is that all the 31-Dec WUs (34 in total) were returned between 11-Jan and 14-Jan, ie, after most of the January work and way after the deadline!! Surely the machine should have gone into EDF mode and completed all the older work first?? On 07-Jan when the deadline had passed for the 31-Dec work, how was the box able to get further new work on 08-Jan with all the old expired work still hanging around??

Maybe there is a rational explanation but as I understand it BOINC just shouldn't allow this to happen. It should have refused to download more work on 08-Jan.

The run length of work on this project is only loosely connected ....


I'm fully aware of all this and whilst you do get variations of the sort you describe, I've never seen those variations to be significant enough to create a 40 day cache!! :). In fact, the normal sprinkling of short lasting results which don't run the full distance means that work is often completed in a significantly shorter time than the cache size would suggest.

Hopefully, these sorts of oddities will be eliminated when the admins get a more recent version of the BOINC server software up and running and properly debugged.


Cheers,
Gary.
ID: 16246 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 16253 - Posted: 6 Feb 2007, 16:45:43 UTC - in response to Message 16244.  


If I am right, then chalk up another low priority task for our new admins.


I reckon you are right but make that high priority rather than low.


Well, I reckon lower than getting the code working on the new Debian servers (ie this bug waits till after the code is working at least as well as on the CERN servers),

lower than getting the ghost host issue sorted as this is creating problems with the size of the db,

lower than getting some kind of issue limit working as if people only got a limited number of WU then the damage done by all kinds of abuse would reduce rather than just fixing a single kind,

lower than export of stats bcause so many people *really* want that,

and arguably even lower than getting a second app going here (as then there will be plenty of work for everyone and people will not be tempted to exploit loopholes)

I would change my mind if there was evidence of folk exploiting it en masse and not exploiting other loopholes.

btw, I feel the need for a quick disclaimer:

I am only expressing a view, of course, and so is Gary. Neither of us is going to be offended if N & A take a different view on the priorities. Ive made my points here (and perhaps Gary will refute them all) in the hope that my thoughts will be helpful to the new admins, not to tell them their job.

R~~
ID: 16253 · Report as offensive     Reply Quote
Profile Neasan
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 30 Nov 06
Posts: 234
Credit: 11,078
RAC: 0
Message 16257 - Posted: 6 Feb 2007, 16:58:48 UTC

I disagree with every point made by everyone ever.

Actually I don't know where this does rate priority wise BUT it is good to be told that it is an issue. Sadly we can't be everywhere all the time and it is great that the community here is on the watch for these types of things.

In the wide view tackling fairness and equality of users is an issue we are very aware of. No one likes getting passed over especially when it appears the users sucking up the jobs can't actually cruch them all within the deadline. That benefits no one, not even the user who is sucking them all in. How to solve this isn't a simple problem either as you may know and hopefully it will be tackled very quickly once we have the service fully under control.
ID: 16257 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 16262 - Posted: 6 Feb 2007, 18:54:43 UTC - in response to Message 16257.  

How to solve this isn't a simple problem either as you may know and hopefully it will be tackled very quickly once we have the service fully under control.


I'd point you to the suggestions made in the fairer distribuition [sic] of work thread, including some of mine and of John Keck. Either set of suggestions would (in my partisan opinion) do more to even things out than sorting out this particulalr bug here, because they would spread the work out more fairly amongs the users asking for it. So somebody trying delibeately to get a weeks work at one go would not be able to, let alone getting 40 days work.

But I am sure you will consider all suggestions and go away and implement some of them and implement some other even better ideas of your own... Good luck and you have my full support even if you don't accept any of my points.

R~~
ID: 16262 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 209
Message 16263 - Posted: 6 Feb 2007, 19:03:20 UTC - in response to Message 16244.  
Last modified: 6 Feb 2007, 19:09:26 UTC

This needs to be given a high priority for fixing as it's otherwise going to be exploited mercilessly and will make the orderly distribution of new work a total shambles.


It might not be that bad as, normally, quorate WUs are purged from the database soon after. Any results coming in after that fail to validate, so get no credit anyway.

Of course that assumes that the DB starts getting cleared again...
(and it would be nice if that would only happen AFTER the deadline has expired! On at least one occasion I've seen the WU purged soon after quorum but before the deadline had passed, so that when my slow machine finished in-time a week after everybody else it couldn't get any credit :()

Henry
(Must be a slower typist than River!)
ID: 16263 · Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 22 Jul 05
Posts: 72
Credit: 3,962,626
RAC: 0
Message 16270 - Posted: 7 Feb 2007, 3:16:18 UTC - in response to Message 16253.  


I am only expressing a view, of course, and so is Gary. Neither of us is going to be offended if N & A take a different view on the priorities.


Exactly!!

Ive made my points here (and perhaps Gary will refute them all) in the hope that my thoughts will be helpful to the new admins, not to tell them their job.

R~~


Nope.... No refutation from me. Just general agreement.

I work on the principle that most people interpret "low priority" to mean "never" or perhaps a month this side of eternity in the overall scheme of things. On the other hand "high priority" means that "we'll get to it when we can" and that will probably be sometime before the project has completely finished :).

Another way that you could look at it is that "high priority" means that we have actually recorded it on the "ToDo List" - somewhere :). You can figure out for yourself where "low priority" stuff gets put :).

Joking aside, I'm just happy to have the problem somewhere "on the Radar".


Cheers,
Gary.
ID: 16270 · Report as offensive     Reply Quote
Daxa

Send message
Joined: 29 Dec 06
Posts: 100
Credit: 184,937
RAC: 0
Message 16271 - Posted: 7 Feb 2007, 4:19:11 UTC
Last modified: 7 Feb 2007, 5:08:13 UTC

This is a bit of an anachronism (and a blatant redundancy), but in terms of "fairer distribution of work", one should seriously consider participating in a BOINC project called XtremLab.

It actually studies WU allotment and overall grid efficiency. It has the potential to fix a LOT of the work distribution problems in most, if not all, BOINC-based projects.

Check out this link for more info. Account Creation is not as straight-forward as with other projects, but it's not difficult to get started (easier than World Community Grid.) I truly believe this is a valuable project for the future of Distributed Computing.

There may be a not-for-non-profit element hidden in this project, but I researched it for about 2 hours and didn't find anything suspicious. I encourage others to do their own research and POST if they find anything.



_______

"Three quarks for Muster Mark!"
. . . . . . . - James Joyce, Finnegans Wake . . . .

ID: 16271 · Report as offensive     Reply Quote
Profile The Gas Giant

Send message
Joined: 2 Sep 04
Posts: 309
Credit: 715,258
RAC: 0
Message 16277 - Posted: 8 Feb 2007, 10:24:28 UTC

It is very easy for someone to alter the DCF in the client_state.xml file with the result that boinc thinks each wu may only take minutes and not hours, which results in many many wu's being downloaded. The DCF will be reset to the correct number after the first wu is completed.

Regards to all (even if I'm not crunching LHC anymore)

Paul.
ID: 16277 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 16305 - Posted: 12 Feb 2007, 18:32:25 UTC - in response to Message 16277.  

It is very easy for someone to alter the DCF in the client_state.xml file with the result that boinc thinks each wu may only take minutes and not hours, which results in many many wu's being downloaded. The DCF will be reset to the correct number after the first wu is completed.

Regards to all (even if I'm not crunching LHC anymore)

Paul.


I wish you hadn't said that.

In general, if someone is going to cheat like mad at least let them do the intellectual bit for themselves...

The same effect can happen naturally if you get a run of very short jobs - which I always think is delayed karma for the disappointement of downloading 8 hours work and seeing it all complete in a few seconds. I agree with Gary to get 40 days worth would be a bit over the top for that mechanism, Ive had 2x or 3x my cache by the workings of this karma but don't think I have seen more than that.

R~~
ID: 16305 · Report as offensive     Reply Quote

Message boards : Number crunching : Machine with a 40 Day Cache that doesn't Timeout?????


©2024 CERN