Message boards :
Number crunching :
can't download
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
But it is always looking at the result with the longest run time left. EDF will switch away from a running result if needed. BOINC WIKI |
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
The obvious problem with this is that things change. If every unit was a CPDN unit then an execution plan might make sense. Failures in computation, completion of computations, and variable length results (which project does that, I wonder) all change the circumstances. The execution plan would have to be recalculated at each termination and at each switch. It seems pretty much like we have already. No, not necessarily... If the work is laid out ... and things are added/changed they yes, the plan needs to be REVIEWED. But, right now we always start with a clean sheet of paper. I am simply saying that if we stop throwing away the prior plan we might be better off. When new work is downloaded, try to add it to the end of the plan, if that does not work, then begin to perturb it. But, preference should still remain with work already started, and to minimize the total change to what you started with before ... Well, not a big issue to me either way. When I get the change to stop rescheduing all CPUs are the drop of a hat, I will likely go to a 6-24 hour switch time so that work is run to completion to the greatest extent possible and only then will the minimal change be made ... I hate to have 75 results in various stages of completion at all times ... |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
The obvious problem with this is that things change. If every unit was a CPDN unit then an execution plan might make sense. Failures in computation, completion of computations, and variable length results (which project does that, I wonder) all change the circumstances. The execution plan would have to be recalculated at each termination and at each switch. It seems pretty much like we have already. Things that would need to change the schedule you are proposing. The obvious ones first. 1) New work downloaded. 2) Work completed. Next a few that are not quite as obvious. 3) Work that is taking longer than it was supposed to based on the initial estimates (there was one project that for a while handed out work that took 800 times as long as the initial estimate). 4) Work that is apparently finishing early. 5) Computer off for a while. 6) BOINC off for a while. Now for a not so obvious one. 7) Changing processor load. I have probably missed a few. All in all, we have to re-calculate frequently. The plans that were valid when the work was downloaded quickly become obsolete. BOINC WIKI |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
One of the problems, as revealed by the spectacular failure of my algorithm, is that the plans change in response to small balances of advantage, which may well swap back the other way next time. We need to respond to changing circumstances but we also need to build in some hysteresis as well. Thinking "out loud" here, but I wonder if something like the following would work? When we calculate a new plan we try for the best possible plan (whatever we finally think that means) and keep it. Firstly: we check the saved plan, and if it still looks like it will get everything in by the deadlines-10%, we don't switch. If not, give it a penalty score = number of deadlines missed * estimated cpu time of those wu when complete (this is a lower = better score that reflects the lost work if the wu is rejected for being late). Give the running plan a 'discount of maybe 25% or 50% of its penalty score. Calculate a new plan and score that, but without the discount - this new plan is adopted only if the (undiscounted) new penalty is less than the (discounted) old penalty. The old-timers advantage is done this way so that in future a range of young-upstart plans could be scored, and the plan with the lowest score is adopted. The discount builds in a certain amount of hysteresis, as required. When first going into panic mode, round robin is the 'old-timer', and we would not abandon that unless one of the alternatives gave significantly less penalty score than just carrying on with RR. Further, round robin includes knowledge of what was running at the time, so two RR schedules can be created and scored each time a reschedule is needed, one (with a discount) keeps all previous running tasks in cpu and only applying true RR once those have finished, the other (without discount) allows tasks to be pre-empted. This would tend to give CPDN long runs at the cpu, and then long rests, where "long" would depend on the deadlines of the other projects. If you had a 4-cpu box and gave CPDN a 25% share, it might be stable enough to keep CPDN in just one cpu but full time. Or it might once again do something silly I haven't thought of... River~~ |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
The basic problem is choosing the "correct" thing to do at any time. You or I could see at a glance what needs to be done, but the computer does not have that much smarts. Unfortunately on a single CPU system there are n! possible ways of ordering the results for the processor, and the problem gets worse for larger numbers of CPUs. For fairly small numbers of results, picking an ordering out of this set is computationally impossible. (for example with 35 results there are about 1 * 10^40 different orderings - a 1 with 40 0's behind it for those not familiar with scientific notation). Therefore it is required that the programmers pick a very small number of well specified algorythms for the selection of a result to run. And a well specified algorythm for determining which to use. Currently those are RR and EDF. The method for determining which to use is to simulate RR and see if it fails. BOINC WIKI |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
Agreed. This is a sensible start to the problem. Agreed that we cannot reasonably expect to assess every ordering - but to sample a few likely methods and choose the best is still well within reach. No solution will ever be perfect - if it was it would take longer to calculate than the wu themselves! So I agree the process of refinement has to stop at some point. My own feeling tho is still that it is sensible to try to refine it a little more than it is at present. The next steps, as I see them, are firstly to work out some kind of scoring to make a choice between RR and EDF when they both fail. At present the algorithm switches to EDF even in those cases where RR fails less badly than EDF! Secondly, once a scoring algorithm is adopted, a discount for continuing the status quo will prevent the kind of 'flapping' that we currently see - the essentiual point Paul has been making for a while. All flexible algorithms come up abainst this issue, there is 'anti'flapping' code in internet routers for example to prevent them changing routes for trivial or transient advantages while allowing them still to re-route for a large or long-lived advantage. On a more prosaic level, we have springs on car axles to keep the wheels on the ground when the car goes over a bump. However without shock absorbers the springs cause chaos. Both responsivity and inertia are needed to get past the very simplest designs in any technology. River~~ |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
Well, the anti-thrashing code is just to let the results run until the next potential task switch. Then choose which result(s) to run next. BOINC WIKI |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
No, it runs till the next *actual* switch which may be much closer than the switch interval, and on a multi-cpu box often is. My point, and Paul's if I understand him correctly, is that that is still thrashing and on a timescale of much less than 1 hour. On a multi cpu box you get several switches per hour - all of them switch when anything happens to any of them). Even when you stick with RR you very often get lots of short fragments which take up swap space unnecessarily. Once you run a short fragment, all the other cpus reschedule at the end of that fragment. This means that if you have fragments of 10min, 20min, 30min left lying around you will almost inevitably be rethinking the strategy every ten minutes or so, whether the switch interval is 60min or 1000min. By the time you get to complete the 30min fragment some more short fragments will have been left incomplete. This is just as bad in terms of swap space etc as my spectacular failure was, and for exactly the same reasons |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens. BOINC WIKI |
Send message Joined: 13 Jul 05 Posts: 23 Credit: 22,567 RAC: 0 |
FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens. what kind of changes were made? (for those, like myself, who are interested in such matters) |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens. A few. 1) an attempt to distinguish between always on connections and sometimes disconnected connections. 2) Removal of the 2 * queue size. Replaced with a computational deadline which is: report deadline - (1 day + queue size + project switch time) for sometimes disconnected clients, and report deadline - (1 day + project switch time) for always connected clients. 3) More emphasis on attempting to keep the queue full for sometimes disconnected clients. 4) For the global work needed the code was changed from queue size * number of cpus - work on hand to an EDF simulation for all CPUs. i.e. start placing the results in individual lists for each CPU in EDF order (only for accounting, not for CPU scheduling). The global work need is the sum of all CPUs that do not have queue size amount of work on hand. This prevents a single CPDN result from showing all of the CPUs on the system as satisfied. 5) Modification of EDF. Instead of EDF, it picks the project with the result of the earliest deadline where the project has a result that is close to or over the computational deadline. This is to allow a CPDN result to get one CPU on a dual CPU system if the CPDN result will take the remaining time to deadline to process even though there are 2 shorter deadline results (say LHC) that are not in danger of going over the deadline. One of these will get the other CPU. 6) In some cases, a single CPU will switch projects leaving the other CPUs running their current work. For example, one CPU is running CPDN result and should run it for an hour. The other CPU is running a Pirates result and will only take 5 minutes. When the Pirates result is done, that CPU alone will be rescheduled, and the CPDN result will continue on for the remainder of the hour. This is also preparation for allowing processes to wait for checkpoints. I have a feeling that I am forgetting something that I did, but for now, that is what I can remember. We shall have to wait and see what actually gets adopted. BOINC WIKI |
Send message Joined: 13 Jul 05 Posts: 23 Credit: 22,567 RAC: 0 |
FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens. sounds great, all in-demand things :) |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
Thanks John, I especially appreciate this one
and Paul can't have seen this post yet, as I'm sure I will hear his shouts of joy from the other side of the Pond when he reads
Let's hope all your work meets approval - as Lee says each one is in demand from someone edits : speeling |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
Thanks John, I especially appreciate this one Paul helped test. I don't have any multiple CPU systems that are fast enough to see this problem. BOINC WIKI |
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
Yes, so far so good. THough I still have it only running on one 4 CPU system. The code that John did seems to be fine. Though I am having some other issues: 1) Attach wizard may have corrupted my Predictor@Home account to the point where my e-mail address does not exist and there is no way to recover the account ... 171,000 cobblestones down the tubes ... I have sent a note to PPAH, but, without an account I cannot post on the boards, and they have not acknowledged the post ... I did post a note about this to the dev forum. 2) There is a crash that seems to be consistent after benchmarks are run. Has happened on two different systems so far. I am waiting to see if it happens again on the Xeon ... and the benchmark numbers are wrong. Of course this is a compile with debug turned on so that may not be unexpected. Well, it did not crash on a forced run of the benchmarks ... but it sure does hammer the numbers! I can fix that though, I saved an old client state file and copied back the old numbers ... |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
Yes, so far so good. THough I still have it only running on one 4 CPU system. The code that John did seems to be fine. Though I am having some other issues: Is it only after the benchmarks run? Will it work if the benchmarks do not run? BOINC WIKI |
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
Yes, still runs fine otherwise. And it does not seem to be everytime the benchmarks run. So, I am still looking to find the exact sequence. But, at the moment, I hesitate to risk more than one system as I am not much into thrill rides... :) Since we don't seem to be making any progress to getting the code checked in (or at least it was not checked in last I looked this afternoon ...), I don't suppose there is any hurry about anything. If you have the time, maybe a compile without the debug code since it seems stable otherwise... I think they also have a problem in the attach wizard as I now have a hammered account. What is worse it is with PPAH and they are about the most unresponsive project going. :( Well, what is 171,000 cobblestones ... ==== edit There is some GOOD news though ... Frys had a sale on ... and Nancy said yes ... $1300 later I changed out a P4 2.8 GHz for a dual core AMD 4400 with 1M L2 per CPU ... the benchmarks I got the first time stink a little, so I will have to try them again when I get done with all the patches ... Oh, not the best MB, Asus A8N32-SLI, 1G Corsair 4200 2-2-2-5 memory ... so, not the best, but not too bad ... at the moment BOINC View says i went from 9 CS/hr to 25 CS/hr ... (i had hoped to see 30 based on an extrapolation of my single core AMD 3500 w/ 1G cache). Too early to tell for sure how well it will do with work though ... by way of comparison, for what it is worth, the projected speed is roughly the same as my dual Xeon 3.4 GHz |
Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0 |
Yes, still runs fine otherwise. And it does not seem to be everytime the benchmarks run. So, I am still looking to find the exact sequence. But, at the moment, I hesitate to risk more than one system as I am not much into thrill rides... :) The code has been checked in. However, it includes a major bug submitted by someone else as a modification to the LTD calculations. Based on the code checked in, I generated a pair of scanarios where a work fetch from one project and a work complete from a second (requires a third project to make it interesting) were done in a different order, and the LTDs were thousands of seconds different. BOINC WIKI |
Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0 |
The code has been checked in. Oh, my ... well, that is not good ... |
©2025 CERN