Message boards :
Number crunching :
Did everyone get work 02 Nov UTC?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
This posting is quite complex. If it looks like more than you want to bother with, then I strongly recommend Mike's two rules in a nearby post. They will work for people with reasonable settings for other parameters and have the huge advantage of being simple to apply. Philip asked: Is it correct that Resource share is not considered at all in your formulas? Resource share is not in those formulae, but is also something to think about. And, being me, I do of course have some algebra to offer. But first the background. If a project has a resource share so low that a cachefull cannot complete in its deadline then that project will be periodically banned without preventing other projects getting work. The client will respond by periodically banning just that one project. We call this NLTD (negative long term debt). In contrast the situation we were mentioning before is where the client temporarily bans *all* projects from downloading, called NWF (no work fetch). NWF is caused when having all client's cache's full at once would mean that deadlines cannot be observed. NLTD is caused where the resource share specified for a single project is insufficient to crunch a cacheful in time. Example of NLTD You give Rosetta 0.1 of the overall resources, and have a connect interval of 3 days. This means a full cache would take 30 days elapsed, but you have given the client a challenge as the work needs to be back in only 10 days. The responds by running the Rosetta work on time, then banning just Rosetta for around 20 days to let the other projects catch up with their appointed shares. During those 20 day times of NLTD, clicking update on Rosetta will not get any work, but other projects will be downloading work fine. Looks spooky but the client is in fact doing what you asked but on a longer timescale. All the odd behaviour of the client in choosing what to download, and what not to, is due to the interaction of these two different issues, NWF and NLTD. When the client is behaving oddly it is beacuse it has noticed something you haven't. To avoid NLTD then for each project (SP / ST) > C / D where SP is the resource share for that project, ST is the total resource share across all projects on that machine, and C and D as before. I will leave you t figure out how it works from the example given. As with my other rule of thumb, in the real world if you get close to the boundary on my rule you will find periods of a few hours where you do drop into NLTD - as before this is because of variation run lengths of tasks and due to the client grabbing slightly more than a cacheful. There is another issue to consider on resource share. We don't mind if LHC is pushed into NLTD after a period of work, as we got our cacheful and crunched it. So the formula we just got applies when regtular work is available, but not for LHC in the current situation. However, we *do* mind if LHC is still in NLTD when the next work is available - if we miss it there might not be any for weeks! John's advice is to give LHC a double resource share compared to your typical other projects. LHC will run, pay off any debt quickly, and be ready to run again next time there is work. Because this project has a double share, there are N+1 shares to allocate, so (SP / ST) > 2 / (N+1) There is no need to share out the other projects equally if you don't want to. If one or more of them go into NLTD that will not harm LHC. Don't give LHC too big a resource share either, or other projects will refuse to give your client work until the box is totaly empty - this is where you get the "Won't finish in time" message. My suggestion is not more than a triple share. So we end up with four formulae C < D / (N+2); # avoid NWF (SP / ST) < 3 / (N+2); # let other projects finish and when work is fairly continuous I suggest (SP / ST) > C / D; # avoid NLTD else when work is intermittent I adopt John's suggestion (SP / ST) > 2 / (N+1); # recover fast from NLTD Hope that helps. If not then I again recommend Mike's advice instead. R~~ |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
I have several projects. My first suggestion then is to use the formulae with N set to include SZTAKI. My second suggestion is to set "No Work" from the manager project tab, let the test work run out (or abort it), and then leave the project active but unable to fetch work. The client will be happier than with suspending. Instead of resume , at the appropriate time simply click "Allow new Work". Only use suspend in the unlikely event that there is a reason to prevent a project runnning that has work on board at the time. Whenever you feel like suspending an empty project then No Work is a better option, imo. Finally, you mention about erratic run lengths and this is an issue in itself. WU completing with a runtime that is very different than estimated causes problems later on, particularly if it affects the majority of tasks. The client "learns" that the project runs short, and will ask for more work to compensate next time around. If the next set of work is more accurate this can overload your cache and cause NLTD and/or NWF next time around. So perversely, if SZTAKI solve that issue it might look to you like something got worse! WU ending with an error are not so bad as the client does not try to learn from them. So my third suggestion to prevent issues from the last test disrupting the current test is to consider resetting the project just after the work runs out each time. You are asking the client to forget - and it is a matter of human judgment whether the good outweighs the bad in the client's recent experience of SZTAKI. If the bad outweighs the good, then reset. If the good outweighs the bad then don't. I have sometimes used reset on LHC for this very reason, see this thread. On SZTAKI the erratic run lengths may be buggy, or may be inherent in the maths. R~~ |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... So, resource share is "catered for" in the assumptions underlying the formula. And, also explains where the N+1 comes from in the denominator.... |
Send message Joined: 4 Sep 05 Posts: 112 Credit: 2,068,660 RAC: 66 |
.... The cache could be set to 3 hours (1 day / 8 hours = 0.125) Between 0.083.... and .125 is 0.1, the default setting. It does just fine. ;-) I was shocked that so many decimals places showed up on the other platforms, I thought it would have been truncated at about four. Click here to join the #1 Aussie Alliance on LHC. |
Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 |
There are users with 10-day caches trying to grab all they can. What about limiting the cache server-side? On my project, I limited users to have at most 3 workunits in progress (per CPU), and I could get all work done much faster, plus all users got something. On LHC, I can see many users don't get anything (because a few get all) and also I can see "Workunits in progress ~10000" for quite a long time. Scientists have to wait weeks to get work finished, while there are lots of computers "idle" in LHC. |
Send message Joined: 26 Nov 05 Posts: 16 Credit: 14,707 RAC: 0 |
River~~ wrote: You give Rosetta 0.1 of the overall resources, and have a connect interval of 3 days. This means a full cache would take 30 days elapsed, ... Perhaps you should mention, that this changed (will change) with Boinc versions > 5.4.x. The actual development scheduler will in this case keep a queue of .3 days of CPU time for Rosetta, which will happily be crunched in 3 days. Norbert PS: You'll have to rework all your formulas ;-) |
Send message Joined: 1 Sep 04 Posts: 275 Credit: 2,652,452 RAC: 0 |
No reworking, they all get dumped. We can then go back to simple advice, set the queue to the max you expect your local ISP to be out at any time. Or less if you are attached to CPDN or many projects. BOINC WIKI BOINCing since 2002/12/8 |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
... that this changed (will change) with Boinc versions > 5.4.x. The actual development scheduler will in this case keep a queue of .3 days of CPU time for Rosetta, which will happily be crunched in 3 days. Doh!!!! I welcome the change as it makes a lot of sense - for one thing the amount of work held in large caches will vary from project to project. But it is not quite correct that it will always keep a queue that small. On 5.4.9 (linux), when the host is unable to connect to other projects it still downloads a full cache from the project it can get to. So new formulae are needed for those who want to guard in case that happens. The 5.4.9 approach does not protect against an N day outage, it protects only against an outage within N days of the previous connection, not the same thing at all. It allows a box to run almost empty in some cases before re-filling to a full cache. That means we now have no way of protecting against an N day outage, except for scheduled outages if we can manually fill up immediately beforehand. It also means that John's advice (unusually for him) is actually rather misleading. Set the interval to the typical downtimes of your ISP and you will typically run out of work half way through an outage, having worked off half the cache before the ISP went down. Setting double the max expected outage will, on average, cover you for three-quarters of ISP down time becasue it covers you completely half the time, and covers you partially the other half of the time. There is no "safe" setting under 5.4.9. There is a long history of trying different things to do with cache sizes, I still say we need to go back to a two-setting system where we set a low tide and a high tide, or low tide and connection interval, but that is an argument for another time and place ;-) R~~ |
Send message Joined: 26 Nov 05 Posts: 16 Credit: 14,707 RAC: 0 |
It also means that John's advice (unusually for him) is actually rather misleading. Set the interval to the typical downtimes of your ISP and you will typically run out of work half way through an outage, having worked off half the cache before the ISP went down. I think John is right, if you're "always on". The download scheduler downloads the moment the queue for a project falls below your "connect interval". So per project your queue is always between <connect interval> and <connect interval + 1 WU>. Norbert |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
It also means that John's advice (unusually for him) is actually rather misleading. Set the interval to the typical downtimes of your ISP and you will typically run out of work half way through an outage, having worked off half the cache before the ISP went down. Then I was totally confused by a combination of one project refusing to send work, and another having a large -ve LTD :( The interactions between the different cases are quite complex. But presumably you mean it is between <(connect intvl) * (resource share)> and <ditto + 1 WU> ? That certainly makes more sense. I am not sure what it does when a project refuses work, as Rosetta is doing to me sporadicaly with complaints that my atticware does not have enough memeory. Should the shares be recalculated to ignore that project, so that the other ptrojects get more work? As you can tell I am still confused by what the 5.4 clients are doing... R~~ |
Send message Joined: 26 Nov 05 Posts: 16 Credit: 14,707 RAC: 0 |
But presumably you mean it is between <(connect intvl) * (resource share)> and <ditto + 1 WU> ?It's even a bit more complicated, because "on fraction", "run fraction" and "CPU efficiency" must be (and are) used. As you can tell I am still confused by what the 5.4 clients are doing...What confuses me more is, that I don't remember, if the 5.4 clients already had this download scheduler or if it was introduced some way into 5.5 :-) Norbert |
Send message Joined: 26 Nov 05 Posts: 16 Credit: 14,707 RAC: 0 |
Should the shares be recalculated to ignore that project, so that the other ptrojects get more work?The projects are ignored as soon as they have no longer any WU on the client. Not quite right, but better than working with the original shares. Norbert |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
But presumably you mean it is between <(connect intvl) * (resource share)> and <ditto + 1 WU> ?It's even a bit more complicated, because "on fraction", "run fraction" and "CPU efficiency" must be (and are) used. aha! On Fraction - that explains my boxes reluctance to pick up work. I run most of my boxes winter only (they heat my lounge nicely), and the boxes that have been turned on recently are reluctant to ask for work. Apart from editing the client state file by hand, is there a way of resetting On Fraction? R~~ |
Send message Joined: 14 Jul 05 Posts: 275 Credit: 49,291 RAC: 0 |
aha! Get/Make a program that does the editing for you...? Or maybe reinstall BOINC. :D |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
Not so. The formulae are designed to stop you going into NWF, or to make sure you only dip in for brief periods (specifically less than 4hrs at a time, so you don't miss work on LHC). One formula that immediately comes to mind is something like C << 0.8 * D In other words if you set your cache longer than the shortest deadline for any project, in the event that that project got a cache full it would always go into NWF. If it gets into the 10% guard band it goes into NWF. So use a 20% guard band to try to avoid NWF. The formula states "much less" because it does not take account of the fact that the NWF tests also include the cache. The real formula may turn out to be something like C < 0.3 * D once that is taken into account. Have the tests for EDF and NWF changed with this change incache policy? They should have done, as the amount of work expected to be downloaded at the next fetch is rather less (one WU at a time, not a cachefull). The sting in the tail is this point that when LHC is empty, the other projects will fill up more. This means that if LHC gets a few WU, suddenly the box thinks it is overcommitted as LHC's share of the cache is now taken by other projects. I have just seen this happen on one of my clients. It has less than one day's work, but because it has a 2 day cache and the whole of the 1 day held locally is a project with a 3 day deadline it won't get any more work from anywhere. That might mean that we need a more complicated formula re-introducing the project share of the intermittent project. But in the meantime, sticking with the default 0.1 day cache looks like a winner. R~~ |
Send message Joined: 26 Nov 05 Posts: 16 Credit: 14,707 RAC: 0 |
C << 0.8 * DMake that C < 0.8 * (D - (1day + switch_int)), because that's the time Boinc tries to send the result back. Norbert |
Send message Joined: 1 Sep 04 Posts: 275 Credit: 2,652,452 RAC: 0 |
C << 0.8 * DMake that C < 0.8 * (D - (1day + switch_int)), because that's the time Boinc tries to send the result back. Damn. You and River are right, we will still have to deal with formulas to prevent NWF. This should be the most acurate one. EDF is mostly gone though, a better CPU scheduler spreads the extra time needed throughout the time the task is on the host. ie. a task needs 10% of the CPU to finish on time it will get that fraction even if it's resource share is much lower, rather than getting 100% when the deadline is near. Also beyond a certain number of projects the queue will not matter in most cases. The min queue will always be 1 task per project until NWF kicks in. BOINC WIKI BOINCing since 2002/12/8 |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
Also beyond a certain number of projects the queue will not matter in most cases. The min queue will always be 1 task per project until NWF kicks in. If it works how the old system worked (but adjusted to hold only Share * Queue), then the requested queue size will still determine when the client asks for more work. Say you have a 0.5 day cache, a Rosetta task that runs for 1 day, and five projects with equal resources. On the old system it would ask for a new Rosetta task when it looked like Rosetta had less than 0.5 days to run. On the new system I am anticipating that it would wait till the Rosetta task had 0.1 days to run (ie Share * Queue) - tho I have not seen this in practice yet Either way, of course, this is based on the illusion that the app correctly reports %complete (on Rosetta it sticks for a while then jumps to the right figure then sticks there for a while, other projects have their own idiosyncratic ways of reporting it) R~~ |
Send message Joined: 18 Sep 04 Posts: 38 Credit: 173,867 RAC: 0 |
a sensible idea imho PovAddict wrote: There are users with 10-day caches trying to grab all they can. What about limiting the cache server-side? On my project, I limited users to have at most 3 workunits in progress (per CPU), and I could get all work done much faster, plus all users got something. On LHC, I can see many users don't get anything (because a few get all) and also I can see "Workunits in progress ~10000" for quite a long time. Scientists have to wait weeks to get work finished, while there are lots of computers "idle" in LHC. two of five machines connected and indebted highly got work;
*two had no excuse for not getting work
|
Send message Joined: 6 Jul 06 Posts: 108 Credit: 663,175 RAC: 0 |
> No work since the 2/11. Have been trying lots but keep missing. 10000 wu's in the last lot means 2000 times 5 replications, so with only 2000 out there no wonder I missed out. |
©2025 CERN