Message boards : Number crunching : Did everyone get work 02 Nov UTC?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15351 - Posted: 4 Nov 2006, 8:03:54 UTC - in response to Message 15348.  
Last modified: 4 Nov 2006, 8:04:26 UTC

This posting is quite complex. If it looks like more than you want to bother with, then I strongly recommend Mike's two rules in a nearby post. They will work for people with reasonable settings for other parameters and have the huge advantage of being simple to apply.

Philip asked:

Is it correct that Resource share is not considered at all in your formulas?


Resource share is not in those formulae, but is also something to think about. And, being me, I do of course have some algebra to offer. But first the background.

If a project has a resource share so low that a cachefull cannot complete in its deadline then that project will be periodically banned without preventing other projects getting work. The client will respond by periodically banning just that one project. We call this NLTD (negative long term debt).

In contrast the situation we were mentioning before is where the client temporarily bans *all* projects from downloading, called NWF (no work fetch).

NWF is caused when having all client's cache's full at once would mean that deadlines cannot be observed.

NLTD is caused where the resource share specified for a single project is insufficient to crunch a cacheful in time.

Example of NLTD

You give Rosetta 0.1 of the overall resources, and have a connect interval of 3 days. This means a full cache would take 30 days elapsed, but you have given the client a challenge as the work needs to be back in only 10 days. The responds by running the Rosetta work on time, then banning just Rosetta for around 20 days to let the other projects catch up with their appointed shares.

During those 20 day times of NLTD, clicking update on Rosetta will not get any work, but other projects will be downloading work fine. Looks spooky but the client is in fact doing what you asked but on a longer timescale.

All the odd behaviour of the client in choosing what to download, and what not to, is due to the interaction of these two different issues, NWF and NLTD. When the client is behaving oddly it is beacuse it has noticed something you haven't.

To avoid NLTD then for each project

(SP / ST) > C / D

where SP is the resource share for that project, ST is the total resource share across all projects on that machine, and C and D as before. I will leave you t figure out how it works from the example given.

As with my other rule of thumb, in the real world if you get close to the boundary on my rule you will find periods of a few hours where you do drop into NLTD - as before this is because of variation run lengths of tasks and due to the client grabbing slightly more than a cacheful.

There is another issue to consider on resource share. We don't mind if LHC is pushed into NLTD after a period of work, as we got our cacheful and crunched it. So the formula we just got applies when regtular work is available, but not for LHC in the current situation.

However, we *do* mind if LHC is still in NLTD when the next work is available - if we miss it there might not be any for weeks!

John's advice is to give LHC a double resource share compared to your typical other projects. LHC will run, pay off any debt quickly, and be ready to run again next time there is work. Because this project has a double share, there are N+1 shares to allocate, so

(SP / ST) > 2 / (N+1)

There is no need to share out the other projects equally if you don't want to. If one or more of them go into NLTD that will not harm LHC.

Don't give LHC too big a resource share either, or other projects will refuse to give your client work until the box is totaly empty - this is where you get the "Won't finish in time" message. My suggestion is not more than a triple share.

So we end up with four formulae

C < D / (N+2); # avoid NWF

(SP / ST) < 3 / (N+2); # let other projects finish

and when work is fairly continuous I suggest

(SP / ST) > C / D; # avoid NLTD

else when work is intermittent I adopt John's suggestion

(SP / ST) > 2 / (N+1); # recover fast from NLTD


Hope that helps. If not then I again recommend Mike's advice instead.

R~~
ID: 15351 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15352 - Posted: 4 Nov 2006, 8:52:41 UTC - in response to Message 15348.  
Last modified: 4 Nov 2006, 9:44:03 UTC

I have several projects.
One of them, SZTAKI seems not well behaved.
Sometimes the WUs don't complete in anywhere near the estimated time.
Sometimes they seem to not complete at all.
So I suspend it "a lot" and run it every few weeks to see if it is doing any better.


My first suggestion then is to use the formulae with N set to include SZTAKI.

My second suggestion is to set "No Work" from the manager project tab, let the test work run out (or abort it), and then leave the project active but unable to fetch work. The client will be happier than with suspending. Instead of resume , at the appropriate time simply click "Allow new Work".

Only use suspend in the unlikely event that there is a reason to prevent a project runnning that has work on board at the time. Whenever you feel like suspending an empty project then No Work is a better option, imo.

Finally, you mention about erratic run lengths and this is an issue in itself.

WU completing with a runtime that is very different than estimated causes problems later on, particularly if it affects the majority of tasks.

The client "learns" that the project runs short, and will ask for more work to compensate next time around. If the next set of work is more accurate this can overload your cache and cause NLTD and/or NWF next time around. So perversely, if SZTAKI solve that issue it might look to you like something got worse!

WU ending with an error are not so bad as the client does not try to learn from them.

So my third suggestion to prevent issues from the last test disrupting the current test is to consider resetting the project just after the work runs out each time. You are asking the client to forget - and it is a matter of human judgment whether the good outweighs the bad in the client's recent experience of SZTAKI. If the bad outweighs the good, then reset. If the good outweighs the bad then don't.

I have sometimes used reset on LHC for this very reason, see this thread.
On SZTAKI the erratic run lengths may be buggy, or may be inherent in the maths.

R~~
ID: 15352 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 15355 - Posted: 4 Nov 2006, 15:24:38 UTC - in response to Message 15351.  

....
John's advice is to give LHC a double resource share compared to your typical other projects....



So, resource share is "catered for" in the assumptions underlying the formula.
And, also explains where the N+1 comes from in the denominator....

ID: 15355 · Report as offensive     Reply Quote
m.mitch

Send message
Joined: 4 Sep 05
Posts: 112
Credit: 2,068,660
RAC: 379
Message 15357 - Posted: 5 Nov 2006, 4:30:47 UTC - in response to Message 15349.  
Last modified: 5 Nov 2006, 4:32:00 UTC

....
(1) set the cache to 2 hours. (1 day/12 hours = 0.0833333333333333333333333333333333334) It fits.
....


Why 2 hours instead of say 70 minutes? (.05)


The cache could be set to 3 hours (1 day / 8 hours = 0.125)

Between 0.083.... and .125 is 0.1, the default setting. It does just fine. ;-)
I was shocked that so many decimals places showed up on the other platforms, I thought it would have been truncated at about four.



Click here to join the #1 Aussie Alliance on LHC.
ID: 15357 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15367 - Posted: 6 Nov 2006, 21:23:59 UTC

There are users with 10-day caches trying to grab all they can. What about limiting the cache server-side? On my project, I limited users to have at most 3 workunits in progress (per CPU), and I could get all work done much faster, plus all users got something. On LHC, I can see many users don't get anything (because a few get all) and also I can see "Workunits in progress ~10000" for quite a long time. Scientists have to wait weeks to get work finished, while there are lots of computers "idle" in LHC.
ID: 15367 · Report as offensive     Reply Quote
NJMHoffmann

Send message
Joined: 26 Nov 05
Posts: 16
Credit: 14,707
RAC: 0
Message 15371 - Posted: 6 Nov 2006, 23:15:20 UTC - in response to Message 15351.  

River~~ wrote:
You give Rosetta 0.1 of the overall resources, and have a connect interval of 3 days. This means a full cache would take 30 days elapsed, ...

Perhaps you should mention, that this changed (will change) with Boinc versions > 5.4.x. The actual development scheduler will in this case keep a queue of .3 days of CPU time for Rosetta, which will happily be crunched in 3 days.

Norbert

PS: You'll have to rework all your formulas ;-)
ID: 15371 · Report as offensive     Reply Quote
Profile Keck_Komputers

Send message
Joined: 1 Sep 04
Posts: 275
Credit: 2,652,452
RAC: 0
Message 15372 - Posted: 6 Nov 2006, 23:58:06 UTC - in response to Message 15371.  


PS: You'll have to rework all your formulas ;-)

No reworking, they all get dumped. We can then go back to simple advice, set the queue to the max you expect your local ISP to be out at any time. Or less if you are attached to CPDN or many projects.
BOINC WIKI

BOINCing since 2002/12/8
ID: 15372 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15381 - Posted: 7 Nov 2006, 15:37:06 UTC - in response to Message 15371.  
Last modified: 7 Nov 2006, 15:38:19 UTC

... that this changed (will change) with Boinc versions > 5.4.x. The actual development scheduler will in this case keep a queue of .3 days of CPU time for Rosetta, which will happily be crunched in 3 days.
... PS: You'll have to rework all your formulas ;-)


Doh!!!!

I welcome the change as it makes a lot of sense - for one thing the amount of work held in large caches will vary from project to project.

But it is not quite correct that it will always keep a queue that small. On 5.4.9 (linux), when the host is unable to connect to other projects it still downloads a full cache from the project it can get to. So new formulae are needed for those who want to guard in case that happens.

The 5.4.9 approach does not protect against an N day outage, it protects only against an outage within N days of the previous connection, not the same thing at all. It allows a box to run almost empty in some cases before re-filling to a full cache. That means we now have no way of protecting against an N day outage, except for scheduled outages if we can manually fill up immediately beforehand.

It also means that John's advice (unusually for him) is actually rather misleading. Set the interval to the typical downtimes of your ISP and you will typically run out of work half way through an outage, having worked off half the cache before the ISP went down. Setting double the max expected outage will, on average, cover you for three-quarters of ISP down time becasue it covers you completely half the time, and covers you partially the other half of the time. There is no "safe" setting under 5.4.9.

There is a long history of trying different things to do with cache sizes, I still say we need to go back to a two-setting system where we set a low tide and a high tide, or low tide and connection interval, but that is an argument for another time and place ;-)

R~~
ID: 15381 · Report as offensive     Reply Quote
NJMHoffmann

Send message
Joined: 26 Nov 05
Posts: 16
Credit: 14,707
RAC: 0
Message 15382 - Posted: 7 Nov 2006, 18:41:02 UTC - in response to Message 15381.  

It also means that John's advice (unusually for him) is actually rather misleading. Set the interval to the typical downtimes of your ISP and you will typically run out of work half way through an outage, having worked off half the cache before the ISP went down.

I think John is right, if you're "always on". The download scheduler downloads the moment the queue for a project falls below your "connect interval". So per project your queue is always between <connect interval> and <connect interval + 1 WU>.

Norbert
ID: 15382 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15384 - Posted: 7 Nov 2006, 20:48:33 UTC - in response to Message 15382.  

It also means that John's advice (unusually for him) is actually rather misleading. Set the interval to the typical downtimes of your ISP and you will typically run out of work half way through an outage, having worked off half the cache before the ISP went down.

I think John is right, if you're "always on". The download scheduler downloads the moment the queue for a project falls below your "connect interval". So per project your queue is always between <connect interval> and <connect interval + 1 WU>.


Then I was totally confused by a combination of one project refusing to send work, and another having a large -ve LTD :( The interactions between the different cases are quite complex.

But presumably you mean it is between <(connect intvl) * (resource share)> and <ditto + 1 WU> ?

That certainly makes more sense.

I am not sure what it does when a project refuses work, as Rosetta is doing to me sporadicaly with complaints that my atticware does not have enough memeory. Should the shares be recalculated to ignore that project, so that the other ptrojects get more work?

As you can tell I am still confused by what the 5.4 clients are doing...
R~~

ID: 15384 · Report as offensive     Reply Quote
NJMHoffmann

Send message
Joined: 26 Nov 05
Posts: 16
Credit: 14,707
RAC: 0
Message 15385 - Posted: 7 Nov 2006, 21:08:57 UTC - in response to Message 15384.  
Last modified: 7 Nov 2006, 21:09:20 UTC

But presumably you mean it is between <(connect intvl) * (resource share)> and <ditto + 1 WU> ?
It's even a bit more complicated, because "on fraction", "run fraction" and "CPU efficiency" must be (and are) used.
As you can tell I am still confused by what the 5.4 clients are doing...
What confuses me more is, that I don't remember, if the 5.4 clients already had this download scheduler or if it was introduced some way into 5.5 :-)

Norbert
ID: 15385 · Report as offensive     Reply Quote
NJMHoffmann

Send message
Joined: 26 Nov 05
Posts: 16
Credit: 14,707
RAC: 0
Message 15386 - Posted: 7 Nov 2006, 21:13:30 UTC - in response to Message 15384.  

Should the shares be recalculated to ignore that project, so that the other ptrojects get more work?
The projects are ignored as soon as they have no longer any WU on the client. Not quite right, but better than working with the original shares.

Norbert

ID: 15386 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15387 - Posted: 8 Nov 2006, 1:45:34 UTC - in response to Message 15385.  

But presumably you mean it is between <(connect intvl) * (resource share)> and <ditto + 1 WU> ?
It's even a bit more complicated, because "on fraction", "run fraction" and "CPU efficiency" must be (and are) used.

aha!

On Fraction - that explains my boxes reluctance to pick up work.

I run most of my boxes winter only (they heat my lounge nicely), and the boxes that have been turned on recently are reluctant to ask for work. Apart from editing the client state file by hand, is there a way of resetting On Fraction?

R~~
ID: 15387 · Report as offensive     Reply Quote
PovAddict
Avatar

Send message
Joined: 14 Jul 05
Posts: 275
Credit: 49,291
RAC: 0
Message 15388 - Posted: 8 Nov 2006, 1:47:50 UTC - in response to Message 15387.  

aha!

On Fraction - that explains my boxes reluctance to pick up work.

I run most of my boxes winter only (they heat my lounge nicely), and the boxes that have been turned on recently are reluctant to ask for work. Apart from editing the client state file by hand, is there a way of resetting On Fraction?

R~~

Get/Make a program that does the editing for you...? Or maybe reinstall BOINC.
:D
ID: 15388 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15389 - Posted: 8 Nov 2006, 2:00:52 UTC - in response to Message 15372.  
Last modified: 8 Nov 2006, 2:24:03 UTC


PS: You'll have to rework all your formulas ;-)

No reworking, they all get dumped. We can then go back to simple advice, set the queue to the max you expect your local ISP to be out at any time. Or less if you are attached to CPDN or many projects.


Not so. The formulae are designed to stop you going into NWF, or to make sure you only dip in for brief periods (specifically less than 4hrs at a time, so you don't miss work on LHC).

One formula that immediately comes to mind is something like

C << 0.8 * D

In other words if you set your cache longer than the shortest deadline for any project, in the event that that project got a cache full it would always go into NWF. If it gets into the 10% guard band it goes into NWF. So use a 20% guard band to try to avoid NWF.

The formula states "much less" because it does not take account of the fact that the NWF tests also include the cache. The real formula may turn out to be something like

C < 0.3 * D

once that is taken into account.

Have the tests for EDF and NWF changed with this change incache policy? They should have done, as the amount of work expected to be downloaded at the next fetch is rather less (one WU at a time, not a cachefull).

The sting in the tail is this point that when LHC is empty, the other projects will fill up more. This means that if LHC gets a few WU, suddenly the box thinks it is overcommitted as LHC's share of the cache is now taken by other projects. I have just seen this happen on one of my clients. It has less than one day's work, but because it has a 2 day cache and the whole of the 1 day held locally is a project with a 3 day deadline it won't get any more work from anywhere.

That might mean that we need a more complicated formula re-introducing the project share of the intermittent project. But in the meantime, sticking with the default 0.1 day cache looks like a winner.

R~~
ID: 15389 · Report as offensive     Reply Quote
NJMHoffmann

Send message
Joined: 26 Nov 05
Posts: 16
Credit: 14,707
RAC: 0
Message 15391 - Posted: 8 Nov 2006, 8:22:59 UTC - in response to Message 15389.  

C << 0.8 * D
Make that C < 0.8 * (D - (1day + switch_int)), because that's the time Boinc tries to send the result back.

Norbert
ID: 15391 · Report as offensive     Reply Quote
Profile Keck_Komputers

Send message
Joined: 1 Sep 04
Posts: 275
Credit: 2,652,452
RAC: 0
Message 15398 - Posted: 8 Nov 2006, 23:39:50 UTC - in response to Message 15391.  

C << 0.8 * D
Make that C < 0.8 * (D - (1day + switch_int)), because that's the time Boinc tries to send the result back.

Norbert

Damn. You and River are right, we will still have to deal with formulas to prevent NWF. This should be the most acurate one. EDF is mostly gone though, a better CPU scheduler spreads the extra time needed throughout the time the task is on the host. ie. a task needs 10% of the CPU to finish on time it will get that fraction even if it's resource share is much lower, rather than getting 100% when the deadline is near.

Also beyond a certain number of projects the queue will not matter in most cases. The min queue will always be 1 task per project until NWF kicks in.
BOINC WIKI

BOINCing since 2002/12/8
ID: 15398 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15408 - Posted: 9 Nov 2006, 20:12:58 UTC - in response to Message 15398.  

Also beyond a certain number of projects the queue will not matter in most cases. The min queue will always be 1 task per project until NWF kicks in.


If it works how the old system worked (but adjusted to hold only Share * Queue), then the requested queue size will still determine when the client asks for more work.

Say you have a 0.5 day cache, a Rosetta task that runs for 1 day, and five projects with equal resources.

On the old system it would ask for a new Rosetta task when it looked like Rosetta had less than 0.5 days to run.

On the new system I am anticipating that it would wait till the Rosetta task had 0.1 days to run (ie Share * Queue) - tho I have not seen this in practice yet

Either way, of course, this is based on the illusion that the app correctly reports %complete (on Rosetta it sticks for a while then jumps to the right figure then sticks there for a while, other projects have their own idiosyncratic ways of reporting it)

R~~
ID: 15408 · Report as offensive     Reply Quote
Profile Morgan the Gold
Avatar

Send message
Joined: 18 Sep 04
Posts: 38
Credit: 173,867
RAC: 0
Message 15409 - Posted: 10 Nov 2006, 0:04:18 UTC - in response to Message 15367.  
Last modified: 10 Nov 2006, 0:08:15 UTC

a sensible idea imho

PovAddict wrote:
There are users with 10-day caches trying to grab all they can. What about limiting the cache server-side? On my project, I limited users to have at most 3 workunits in progress (per CPU), and I could get all work done much faster, plus all users got something. On LHC, I can see many users don't get anything (because a few get all) and also I can see "Workunits in progress ~10000" for quite a long time. Scientists have to wait weeks to get work finished, while there are lots of computers "idle" in LHC.


two of five machines connected and indebted highly got work;
    *one was nnw because of sztaki
    *two had no excuse for not getting work



my 'backoff' is < 4 hours , lhc is set alike pirates and simap, and i get wu on all boxes ,when they 'burst'


ID: 15409 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 108
Credit: 663,175
RAC: 0
Message 15410 - Posted: 10 Nov 2006, 0:59:12 UTC

> No work since the 2/11. Have been trying lots but keep missing.
10000 wu's in the last lot means 2000 times 5 replications, so with only 2000 out there no wonder I missed out.
ID: 15410 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Did everyone get work 02 Nov UTC?


©2024 CERN