Message boards : Number crunching : Fairer distribuiton of work(Flame Fest 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 10 · Next

AuthorMessage
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 14917 - Posted: 1 Oct 2006, 8:23:21 UTC

Hi,

Clearly we are all here because we want to support the science that LHC will be doing in the future by supporting the engineering that is being done right now to build the LHC.

Equally clearly, from that point of view it does not matter if the work is spread fairly or not. If the sums are correct then that is all the engineers need.

However, many of us chose to add to that noble motive a secondary one, of competing against others (teams etc). I have not got into that, but still like to compete against myself, improving on my previous past scores. For any of us who care at all about scores, there is some importance to having the work spread out fairly.

This is sometimes suggestd as being about the projects needs in a logical, scientific, or engineering sense; I don't think that is right as clearly the work is being done OK, and anyway I would not presume to tell the project how to do any of those things.

It is about keeping particiapnts happy, our "needs" in a social science sense being an important part of any volunteer programme.

The easiest thing that could be done is to lower the quota while work is scarce. A small number, in single figures, on the day that work becomes available and increasing maybe 36hours later if there is any work left. Or with runs of only 75,000 results like we had recently, you might find that they all got done leaving the quota small.

There were around 6000 hosts granted credit last time we had a substantial block of work. The initial replication on this project is 5, I think. So an easy calculation to set the quota size would be

where W is no of workunits

R = W * 5
Q = R / 6000

if Q < 4 then Q = 4 # set some lower limit
if Q > 100 then Q = 100 # and don't go above current quota

With the recent run of 15000 WU, this gives 75000 results and a quota of 12 or 13.

Notice that the quota only stops somneone grabbing an unfair share on the first day - they can come back for more next day if there are any left.

I'd ask you to consider trying this, setting the quotas manually at the start of a batch of work and monitor the outcome to see if it had any impact on the time taken to get all the work back. In principle it could go either way - smaller quotas take longer to distribute, but smaller quotas also mean that individual machines turn work around faster.

If it was found to work, then you could automate it, by writing code that did the same calculation once a day on the basis of the number of results available in the system that day.

Just a suggestion: it *is* your project. And of course it come lower down in the list than getting things working on the new server and getting Garfield settled in. But once you have time to look at it, then it might make a lot of participants feel happier.

It might also upset some - anyone feel this would be a bad idea? Constructive criticism always welcome...

River~~
ID: 14917 · Report as offensive     Reply Quote
Mattia Verga

Send message
Joined: 27 Sep 04
Posts: 20
Credit: 23,880
RAC: 0
Message 14918 - Posted: 1 Oct 2006, 8:33:14 UTC

I agree with River. I think limiting the cache should be helpful also to speed up crunching. If a single user can download 50 or 100 WU, where's the power of distributed computing?? We all have to wait that user to complete his queue for having another batch of WU.
ID: 14918 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,711,225
RAC: 1,293
Message 14923 - Posted: 1 Oct 2006, 9:03:14 UTC

The project-wide maximum work unit/day setting is not intended to be used like this. It is stored in the project configuration file. Changing it requires editing the file and then restarting at least the validator, and possibly other parts of the project as well. Even then, a given host record in the database will not pick up the change until it has a result go through the validator, unless manual steps are taken.

However this is a bad idea because the daily quota is not intended to be used as a fair distribution method. It is there to minimize the impact of faulty hosts on the project. Without it, a bad host could download thousands of work units per day and upload garbage - or nothing for that matter. With the quota system, a host has its quota reduced every time it sends back a bad result until it is limited to 1/day until it is fixed. If you start mucking with the quota manually, bad hosts could become a problem again. This could possibly be avoided but it would take a fair amount of thought and effort. With the current state of the project I don't see that happening.

My opinion is "leave well enough alone". They are getting their work done, we are getting work units. As long as you are attached to multiple projects you will always have work to do.
- A member of The Knights Who Say NI!
My BOINC stats site
ID: 14923 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 14925 - Posted: 1 Oct 2006, 9:19:08 UTC - in response to Message 14918.  

If a single user can download 50 or 100 WU, where's the power of distributed computing?? We all have to wait that user to complete his queue for having another batch of WU.


A point for consideration :
Some users can finish a batch of 100 WU's within a matter of hours, not days or even weeks.
Technically, that's the power of Distributed Computing as well.

IMHO, the System is 'fair' enough as it is, as it doesn't matter who does the job.

Once Project admins decide it becomes a problem, they can easily reduce the deadline, but so far that's not the case.
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 14925 · Report as offensive     Reply Quote
KWSN - A Shrubbery
Avatar

Send message
Joined: 3 Jan 06
Posts: 14
Credit: 32,201
RAC: 0
Message 14935 - Posted: 1 Oct 2006, 17:06:51 UTC

I'll have to side with the leave it alone crowd.

Of the last five rounds of work my computers missed out on the first four. Due to the nature of randomness, all four of them checked for work within 5 minues of each other and all got work on this go.

As everyone is operating under the same rule set it is inherently fair. If a computer is capable of downloading large amounts of work and completing it, I see no issue. If the work is intentionally distributed to the largest user base possible then you will invevitably be dealing with some of the slowest systems out there both in processing time and internet connections.

So, that leaves possible advantages to your proposal and probable dis-advantages. I don't care for those odds.

In any case, this whole argument will be moot as soon as Garfield comes on line.
ID: 14935 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 14951 - Posted: 2 Oct 2006, 17:11:33 UTC
Last modified: 2 Oct 2006, 17:13:23 UTC

Not this again...

There have been endless arguments about fair distribution of work, and while I generally side with the 'keep your cache small' camp, there will always be the diehards who, rightly or wrongly, take large amounts of work just to leave it sitting in a cache for days. I can't be bothered to argue the point any more.

It is CERN's problem if the the work takes too long to complete, and they are the people to decide if that is so, and to take action if it is needed. So, some time ago Chrulle implemented a dynamic deadline system which aims to optimise the return of work.

If this still doesn't fit your ideology crunch other projects while there's no LHC work, and look forward to Garfield.


Gaspode the UnDressed
http://www.littlevale.co.uk
ID: 14951 · Report as offensive     Reply Quote
Profile Trane Francks

Send message
Joined: 18 Sep 04
Posts: 71
Credit: 28,399
RAC: 0
Message 15027 - Posted: 9 Oct 2006, 12:10:59 UTC

Our good friend Gaspode, whom I thought had already left the project for personal reasons, really summed it up well here. I, too, dislike the "huge cache" group who inevitably steal WUs away from other users. On the other hand, I also have to say that I side with the "don't mess with it" bunch. The system works. I don't get work as often as I like and, due to crunching for a CPDN-related project (SAP) that needs to have its result returned practically immediately to be of use to the 1st-tier research, I actually have all my CPU power diverted so I missed this last round entirely.

Oh, well. That's life when you're dealing in 600-hour work units and the project is winding down.

Gaspode got it right: crunch other projects so that you're not idling and forcing BOINC to connect just to grab whatever work is there. Each of my BOINC clients run between 5-7 projects at any given time, unless project requirements/deadlines interfere - and CPDN/SAP have excelled at interfering. Heh. ;-)

Anyway.
ID: 15027 · Report as offensive     Reply Quote
uioped1

Send message
Joined: 26 Aug 05
Posts: 18
Credit: 37,965
RAC: 0
Message 15035 - Posted: 10 Oct 2006, 3:30:53 UTC

I have ended up with a long-ish post, that might disguise the fact that I share your sentiment. I felt the need to clear up a perceived misconception.

It is important to note the distinction between having the work finished, and having all the results returned.

If you looked at the graphs shortly after the end of this run, you noticed that shortly after the 7th (I think) the remaining results plummetted as the last workunits missed their deadline. Because these results were not re-issued (the graph stayed down at 0) and also because of the small number of them (~1000, from a batch of results significantly greater than 5000) we can deduce that none of those results were neeeded to complete a quorum.

If we assume that most of the quorums are reached shortly after the project falls below 4/5 of the initial batch size, this batch was actually done after about 3 days. (this is all from memory, as the relevant paerts of the curve have all fallen off the graphs by now. Forgive my approximations, please.)

If you looked at the rapidity with which the slope decreased, you can tell that the ideal return for the project would have been somewhere between 1 day to 1.5 days shorter (that 4/5 target).

My point is that CERN doesn't have a lot to gain from decreasing the quotas. In fact, any qouta system where very fast hosts are throttled will lengthen the initial part of the curve, even if it does provide some benefit at the tail end of the curve from choking off greedy hosts. Also, since we know that the trailing results are unnecessary, we cannot blame them for the delay in creating new work.

Furthermore, if Garfield does come on line, and if that app provides more steady work, any benefit that would have been provided by the quota would be rendered irrelevent, while the theoretical harm would remain.

Boinc has another method more suited for this problem, which is the use of deadlines to prevent waiting for slow caches. I believe that the scheduler does a calculation to make sure that work can be completed before the deadline before it gives you some. However this would cause other people annoyance as it throws the host's scheduler for a loop. (and everyone who runs this project really ought to be running other projects. I strongly recommend Rosetta@home.)

A third possibility would be to simply increase the initial replecation, however I don't think that anyone would like this as it would mean that much les value for the work you do return.

Personally, I would love to see tighter deadlines. I hate sitting around watching the outstanding results trickle down when
I haven't had work for days.

Full disclosure: One of my hosts had a configuration error and downloaded far too much work this round. The last 1/4 were put to no good use, being the last results returned for their respective WUs.
ID: 15035 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15045 - Posted: 11 Oct 2006, 5:05:22 UTC
Last modified: 11 Oct 2006, 5:46:51 UTC

Well, not a lot of support for my idea. And now we have another release of work, another micro-release of 8000 results and about 6000 of us chasing them.

From another thread, what do I see

43 new units here.


and later after someone notices those 43 tasks are on a slow computer the owner says

Oh, I'd be happier if the 43 units were on a
dual 2 GHz Xeon computer.

The dice rolled differently.


I've got 11 machines, just one got work, just one WU. Having got nothing in the last three rounds of work. And someone boasts at getting 43... OK it is not his fault he got lucky in the lottery, but do you not see that there is a human issue here?

Those of you who make positngs that cajole users not to use large cache settings, can you not see how an experience like this makes me want to set a bigger cache for next time round to make sure I get something that will reclaim my lost position in the stats?

10000 results split between about 6000 of us trying to get them I still reckon it would be much fairer to have a very small quota on these very small runs. Just one result in every third release is not going to be enough to catch up the position I have lost.

With 6000 boxes asking for work, each of them several times in each 24hrs, and only 10000 tasks in the new release, there is no technical advantage to the project to give anyone more than one or two tasks at a time, much less than to give 43 tasks to a 600MHz box that will take a many days to crunch them. The only technical advantage is the inertia of leaving the code as it is.

I do not blame the guy for getting so many WU, if the project hand them out like a lottery then the only way to keep one's position intact is to grab as many as we can when we get the chance. The solution lies with the project not with throwing flak at those who win the lottery.

The current set up actively encourages people to set odd cache settings to make sure they get multiple WU when any are on offer.

Several people have made the point that the current system is working for the project so it should not be tampered with. I disagree. The current system is working only on a technical level, it is failing on a human level in that the credit system, which is supposed to be an incentinve, is becoming a discouragement to many participants.

Yes, I understand that the project does not have the resources to make changes at present, but I personally still hope that some technical fix to this human problem can be found at a later date, and I personally feel that this is something that could usefully be put on a project to do list.

In short, it makes sense not to try to run a project that has sporadic small bursts of work on the same settings that were appropriate when it had large bursts of work. It makes sense at least to timetable a slot to look at the issue, when someone is in post and when the fire fighting is all sorted.

In my opinion anyway.

And finally, in another thread someone refers to this thread as if it is a complaint. It is not meant to be. I am not complaining at the project, nor at the lucky lottery winner, tho I am envious.

I love this project and want it to continue to do well. I see this as an issue that in the long run will undermine the project from within its donor community, and I raise the issue here in a friendly way (I hope!) as part of my support for this project. The envy that the lottery effect produces is a bug in the human aspects of this project, just as much as a glitch in the code is a bug in the tecnical performance. Just like I would report a technical bug without complaining about it, I again draw attention to the way the current software settings are invoking a liveware bug.

R~~
ID: 15045 · Report as offensive     Reply Quote
Profile Keck_Komputers

Send message
Joined: 1 Sep 04
Posts: 275
Credit: 2,652,452
RAC: 0
Message 15053 - Posted: 11 Oct 2006, 7:11:05 UTC

I tend to agree that the project's settings need tweaking. I would think that adding/editing the following tags in the config.xml file would improve things for all concerned. This advice may need reconsidering if/when there is a steady supply of work.

<max_wus_to_send>5</max_wus_to_send>
This would allow the server to send no more than 5 results per scheduler RPC.
<min_sendwork_interval>600</min_sendwork_interval>
This would make the host wait 10 minutes before getting more work.

Since the daily quota is unchanged these settings would still allow fast hosts or hosts with large queues to fill up. However it would delay them so if there is a small supply of work it will be spread out more. For example it would take about 90 minutes for a host to get 43 workunits.
BOINC WIKI

BOINCing since 2002/12/8
ID: 15053 · Report as offensive     Reply Quote
AnRM

Send message
Joined: 14 Jul 05
Posts: 13
Credit: 424,554
RAC: 0
Message 15059 - Posted: 11 Oct 2006, 11:39:46 UTC
Last modified: 11 Oct 2006, 11:45:03 UTC

This lottery business is very frustrating....I hope 'Garfield' lives up to its fat cat namesake and provides a sustained workload for all. John's proposal seems very practical and fair but I won't hold my breath...Cheers, Rog.
ID: 15059 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15060 - Posted: 11 Oct 2006, 11:50:45 UTC - in response to Message 15053.  
Last modified: 11 Oct 2006, 11:51:39 UTC

I tend to agree that the project's settings need tweaking. I would think that adding/editing the following tags in the config.xml file would improve things for all concerned. ...

<max_wus_to_send>5</max_wus_to_send>
This would allow the server to send no more than 5 results per scheduler RPC.
<min_sendwork_interval>600</min_sendwork_interval>
This would make the host wait 10 minutes before getting more work.


This seems a good idea, and probably better than my quota adjustment idea due to the drawbacks claimed by others.

One drawback of these settings, however, might be that encouraging hosts to come back for more after ten minutes might increase the network congestion. I wonder how long it could be extended to?

Is this figure of 5 per cpu or per host?

If it is per cpu then it would seem to me that a 30min sendwork interval would make sense. If it per host then even 20min might be too long for the dual dual boxes out there (ie two chips each with two cores).

Reason for thinking about a 30min interval is that some boxes will be on a 4hr standoff. The boxes that get lucky will come back in 10 mins, repeatedly, and in 90mins will have emptied the queue while those still in the long standoff will still miss out.

With a 30min interval it would take over 4hrs for a box to grab 43 WU, and the other guy will likely have checked in at least once in that time.

If 30min would be too long, then the 10min interval would still be better than the current settings because at least some of the other hosts would get a look in.

R~~
ID: 15060 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 15062 - Posted: 11 Oct 2006, 12:12:32 UTC - in response to Message 15060.  

would a truckload of WU's solve your problem / end the discussion?

My wild guess is that the 8000 WU's that were out, were a test run after the system was moved to new hardware.... Don't bother, I'm sure that there will be work...

Hey, I just turned 32 today. That gives me at least another 32 years of time to wait for LHC or CERN WU's... ;-)

*opens a bottle of champaign (the french one) for all the members with > 1 total credit*

;-) *PARTY*
ID: 15062 · Report as offensive     Reply Quote
watnou

Send message
Joined: 1 Sep 04
Posts: 101
Credit: 1,395,204
RAC: 0
Message 15063 - Posted: 11 Oct 2006, 12:27:56 UTC - in response to Message 15062.  

i think this discussion will never end
because there is always some1 who thinks its unfair whatever the solution is.

personally i think its fair if i get all the wu's and the rest of you can get the leftovers, but i suspect that i'm in the minority about this
:)


would a truckload of WU's solve your problem / end the discussion?

My wild guess is that the 8000 WU's that were out, were a test run after the system was moved to new hardware.... Don't bother, I'm sure that there will be work...

Hey, I just turned 32 today. That gives me at least another 32 years of time to wait for LHC or CERN WU's... ;-)

*opens a bottle of champaign (the french one) for all the members with > 1 total credit*

;-) *PARTY*


ID: 15063 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 108
Credit: 663,175
RAC: 0
Message 15064 - Posted: 11 Oct 2006, 13:05:36 UTC

I believe that a Fairer distribution of work may or may not solve distribution problems simply as not everybody has the same settings on their computers. You would need the same setting for all computers to have an equal chance of getting work and also not all computers contact the server at the same time so will miss out on small amounts of the available work.
I don't tamper a great deal with settings but have on occassions needed to adjust a number of projects. In the case of LHC@home I basically set and forget in the hope I will get something from the server one day (my computer has contacted the LHC servers over 3600 times without success over 4 months).
Why? Well since joining on the 6/7/06 (I think), today, the Eleventh day of October 2006, is the first day that I have finally gotten some work and now have credits to my name.
I know some grab more WU's than others, with myself getting 11 (don't forget these are my first WU's) this time round and I have seen a couple with 39 and 33.
With a default setting of 500 WU's a day allowed per CPU, when a lot of work is available I believe a lot of computers will be swamped and won't complete the work on time.
If what I have said is off the mark, you can let me know as it will increase my understanding of how things work.
All I can add is I am very happy to get my very first WU's and they have not failed.
Where is the party? I have now joined everyone with >1 Cobblestone.

>> Keep smiling for it makes others wonder what you have been up to.
ID: 15064 · Report as offensive     Reply Quote
uioped1

Send message
Joined: 26 Aug 05
Posts: 18
Credit: 37,965
RAC: 0
Message 15067 - Posted: 11 Oct 2006, 17:52:06 UTC

This time, I will try for sucinctity, I promise:)

With the current batch of WUs, (It looked like there were about 10000) my estimate for the ideal time to completion is 32 hours, with statistical completion in 24. Of course, if everyone had gotten at least 1 wu, that would have been cut down to about 6... We'll see how far off reality comes. My suspicion is that there will be units in progress until the deadline.

If they had set the deadline to a conservative 2 days, we never would have seen the eggregious abuses river pointed out.

He's also right that keeping the majority of the users happy should also be a priority, although I think that this project generally has a brick wall for ears.

I really want to see tighter deadlines for this project, and we don't even need to go as far as chrulle's adjustable ones.


ID: 15067 · Report as offensive     Reply Quote
Profile Keck_Komputers

Send message
Joined: 1 Sep 04
Posts: 275
Credit: 2,652,452
RAC: 0
Message 15068 - Posted: 11 Oct 2006, 22:51:20 UTC - in response to Message 15060.  

I tend to agree that the project's settings need tweaking. I would think that adding/editing the following tags in the config.xml file would improve things for all concerned. ...

<max_wus_to_send>5</max_wus_to_send>
This would allow the server to send no more than 5 results per scheduler RPC.
<min_sendwork_interval>600</min_sendwork_interval>
This would make the host wait 10 minutes before getting more work.


This seems a good idea, and probably better than my quota adjustment idea due to the drawbacks claimed by others.

One drawback of these settings, however, might be that encouraging hosts to come back for more after ten minutes might increase the network congestion. I wonder how long it could be extended to?

Is this figure of 5 per cpu or per host?
This is per RPC or scheduler connection.

If it is per cpu then it would seem to me that a 30min sendwork interval would make sense. If it per host then even 20min might be too long for the dual dual boxes out there (ie two chips each with two cores).

Reason for thinking about a 30min interval is that some boxes will be on a 4hr standoff. The boxes that get lucky will come back in 10 mins, repeatedly, and in 90mins will have emptied the queue while those still in the long standoff will still miss out.

With a 30min interval it would take over 4hrs for a box to grab 43 WU, and the other guy will likely have checked in at least once in that time.

If 30min would be too long, then the 10min interval would still be better than the current settings because at least some of the other hosts would get a look in.

R~~

You want the settings loose enough so that a host can almost process all of the work recieved in the RPC before the next RPC. With the new core 2 chips and the way most of the work lately has ended early these settings may be too restrictive.

As mentioned in another post changing to very short deadlines may be a better way to adjust things. For this to work the server has to be set to not send work if can not be completed in time though. It would cut out participants with large queues though. Any queue longer than the deadline (or half the deadline) would never get work.
BOINC WIKI

BOINCing since 2002/12/8
ID: 15068 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15077 - Posted: 12 Oct 2006, 6:33:15 UTC - in response to Message 15068.  
Last modified: 12 Oct 2006, 6:41:41 UTC

As mentioned in another post changing to very short deadlines may be a better way to adjust things. [...] Any queue longer than ... half the deadline would never get work.


some would see this as an added bonus ;-)

But I am not sure it is correct. If you have two projects and queue = half deadline, then it will try to get more work from LHC at times. If the other project has a long deadline, then it will allow fetch on the grounds the the LHC work can be snuck in in front of the already long queue. If it were not so, surely there would be probs running CPDN with any other project?

I have seen this happen, ie new work snuck in front of a running CPDN task, but not sure on how recent a client.
R~~
ID: 15077 · Report as offensive     Reply Quote
Profile Keck_Komputers

Send message
Joined: 1 Sep 04
Posts: 275
Credit: 2,652,452
RAC: 0
Message 15080 - Posted: 12 Oct 2006, 9:52:45 UTC - in response to Message 15077.  

As mentioned in another post changing to very short deadlines may be a better way to adjust things. [...] Any queue longer than ... half the deadline would never get work.


some would see this as an added bonus ;-)

But I am not sure it is correct. If you have two projects and queue = half deadline, then it will try to get more work from LHC at times. If the other project has a long deadline, then it will allow fetch on the grounds the the LHC work can be snuck in in front of the already long queue. If it were not so, surely there would be probs running CPDN with any other project?

I have seen this happen, ie new work snuck in front of a running CPDN task, but not sure on how recent a client.
R~~

The problem is the server thinks the host will not be connecting again in time to report the task, it does not matter if it can be processed in that time. If it is half or all of the deadline depends on what version of the server software is in use. The client will process the task first if the server sends it though.
BOINC WIKI

BOINCing since 2002/12/8
ID: 15080 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 15087 - Posted: 13 Oct 2006, 3:45:28 UTC

Given two distribution methods, what is the metric that shows one is "fairer" than the other?

Should the distribution consider the speed of the machines receiving the work?

Put another way, is it fair to give a 3ghz machine the same number of workunits as a 1ghz machine?
Or, should the 3ghz machine receive 3times the work units of the 1ghz machine?

Should someone who invests in broadband be given the same number of work units as someone who only pays for dialup?

Should a machine that completes a group of tasks in 3 days receive the same number of work units as a machine that completes the same number in one day?




ID: 15087 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 10 · Next

Message boards : Number crunching : Fairer distribuiton of work(Flame Fest 2007)


©2024 CERN