Message boards : Number crunching : can't download
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 11903 - Posted: 13 Jan 2006, 14:09:05 UTC - in response to Message 11901.  


Giving an excess share to the project in trouble is one approach, but it still does not handle all cases properly. It fails spectacularly when ...

Have you seen my suggestion about ...

I actually implemented it, and it failed spectacularly all of the time...

doh!

yes of course it would do that - its effectively looking to run the longest first and that is inherently destabilising... must think some more about this

Maybe I am dense, but, IN THIS CASE, should not the "tie-breaker" be the fact that the one work unit has been started?

==== Edit

The other question is why can't the system derive a "execution plan" and SAVE it, rather than recalculating everything from scratch each switch time. I mean, if we had an execution plan, the work units would be "slotted" and when time came to switch back, the work in progress would be continued.

But it is always looking at the result with the longest run time left. EDF will switch away from a running result if needed.


BOINC WIKI
ID: 11903 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 11904 - Posted: 13 Jan 2006, 14:53:44 UTC - in response to Message 11902.  
Last modified: 13 Jan 2006, 14:54:10 UTC

The obvious problem with this is that things change. If every unit was a CPDN unit then an execution plan might make sense. Failures in computation, completion of computations, and variable length results (which project does that, I wonder) all change the circumstances. The execution plan would have to be recalculated at each termination and at each switch. It seems pretty much like we have already.

No, not necessarily...

If the work is laid out ... and things are added/changed they yes, the plan needs to be REVIEWED. But, right now we always start with a clean sheet of paper. I am simply saying that if we stop throwing away the prior plan we might be better off.

When new work is downloaded, try to add it to the end of the plan, if that does not work, then begin to perturb it. But, preference should still remain with work already started, and to minimize the total change to what you started with before ...

Well, not a big issue to me either way. When I get the change to stop rescheduing all CPUs are the drop of a hat, I will likely go to a 6-24 hour switch time so that work is run to completion to the greatest extent possible and only then will the minimal change be made ...

I hate to have 75 results in various stages of completion at all times ...
ID: 11904 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 11905 - Posted: 13 Jan 2006, 15:12:54 UTC - in response to Message 11904.  

The obvious problem with this is that things change. If every unit was a CPDN unit then an execution plan might make sense. Failures in computation, completion of computations, and variable length results (which project does that, I wonder) all change the circumstances. The execution plan would have to be recalculated at each termination and at each switch. It seems pretty much like we have already.

No, not necessarily...

If the work is laid out ... and things are added/changed they yes, the plan needs to be REVIEWED. But, right now we always start with a clean sheet of paper. I am simply saying that if we stop throwing away the prior plan we might be better off.

When new work is downloaded, try to add it to the end of the plan, if that does not work, then begin to perturb it. But, preference should still remain with work already started, and to minimize the total change to what you started with before ...

Well, not a big issue to me either way. When I get the change to stop rescheduing all CPUs are the drop of a hat, I will likely go to a 6-24 hour switch time so that work is run to completion to the greatest extent possible and only then will the minimal change be made ...

I hate to have 75 results in various stages of completion at all times ...

Things that would need to change the schedule you are proposing.
The obvious ones first.
1) New work downloaded.
2) Work completed.

Next a few that are not quite as obvious.
3) Work that is taking longer than it was supposed to based on the initial estimates (there was one project that for a while handed out work that took 800 times as long as the initial estimate).
4) Work that is apparently finishing early.
5) Computer off for a while.
6) BOINC off for a while.

Now for a not so obvious one.
7) Changing processor load.

I have probably missed a few.

All in all, we have to re-calculate frequently. The plans that were valid when the work was downloaded quickly become obsolete.


BOINC WIKI
ID: 11905 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11907 - Posted: 13 Jan 2006, 17:53:05 UTC - in response to Message 11905.  


All in all, we have to re-calculate frequently. The plans that were valid when the work was downloaded quickly become obsolete.


One of the problems, as revealed by the spectacular failure of my algorithm, is that the plans change in response to small balances of advantage, which may well swap back the other way next time. We need to respond to changing circumstances but we also need to build in some hysteresis as well.

Thinking "out loud" here, but I wonder if something like the following would work?


When we calculate a new plan we try for the best possible plan (whatever we finally think that means) and keep it.

Firstly: we check the saved plan, and if it still looks like it will get everything in by the deadlines-10%, we don't switch.

If not, give it a penalty score = number of deadlines missed * estimated cpu time of those wu when complete (this is a lower = better score that reflects the lost work if the wu is rejected for being late). Give the running plan a 'discount of maybe 25% or 50% of its penalty score.

Calculate a new plan and score that, but without the discount - this new plan is adopted only if the (undiscounted) new penalty is less than the (discounted) old penalty.

The old-timers advantage is done this way so that in future a range of young-upstart plans could be scored, and the plan with the lowest score is adopted. The discount builds in a certain amount of hysteresis, as required.


When first going into panic mode, round robin is the 'old-timer', and we would not abandon that unless one of the alternatives gave significantly less penalty score than just carrying on with RR.

Further, round robin includes knowledge of what was running at the time, so two RR schedules can be created and scored each time a reschedule is needed, one (with a discount) keeps all previous running tasks in cpu and only applying true RR once those have finished, the other (without discount) allows tasks to be pre-empted.

This would tend to give CPDN long runs at the cpu, and then long rests, where "long" would depend on the deadlines of the other projects. If you had a 4-cpu box and gave CPDN a 25% share, it might be stable enough to keep CPDN in just one cpu but full time. Or it might once again do something silly I haven't thought of...

River~~
ID: 11907 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 11910 - Posted: 13 Jan 2006, 19:58:48 UTC

The basic problem is choosing the "correct" thing to do at any time. You or I could see at a glance what needs to be done, but the computer does not have that much smarts. Unfortunately on a single CPU system there are n! possible ways of ordering the results for the processor, and the problem gets worse for larger numbers of CPUs. For fairly small numbers of results, picking an ordering out of this set is computationally impossible. (for example with 35 results there are about 1 * 10^40 different orderings - a 1 with 40 0's behind it for those not familiar with scientific notation). Therefore it is required that the programmers pick a very small number of well specified algorythms for the selection of a result to run. And a well specified algorythm for determining which to use.

Currently those are RR and EDF. The method for determining which to use is to simulate RR and see if it fails.



BOINC WIKI
ID: 11910 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11947 - Posted: 14 Jan 2006, 16:11:47 UTC - in response to Message 11910.  


Currently those are RR and EDF. The method for determining which to use is to simulate RR and see if it fails.


Agreed. This is a sensible start to the problem.

Agreed that we cannot reasonably expect to assess every ordering - but to sample a few likely methods and choose the best is still well within reach.

No solution will ever be perfect - if it was it would take longer to calculate than the wu themselves! So I agree the process of refinement has to stop at some point. My own feeling tho is still that it is sensible to try to refine it a little more than it is at present.

The next steps, as I see them, are firstly to work out some kind of scoring to make a choice between RR and EDF when they both fail. At present the algorithm switches to EDF even in those cases where RR fails less badly than EDF!

Secondly, once a scoring algorithm is adopted, a discount for continuing the status quo will prevent the kind of 'flapping' that we currently see - the essentiual point Paul has been making for a while. All flexible algorithms come up abainst this issue, there is 'anti'flapping' code in internet routers for example to prevent them changing routes for trivial or transient advantages while allowing them still to re-route for a large or long-lived advantage.

On a more prosaic level, we have springs on car axles to keep the wheels on the ground when the car goes over a bump. However without shock absorbers the springs cause chaos. Both responsivity and inertia are needed to get past the very simplest designs in any technology.

River~~
ID: 11947 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 12001 - Posted: 15 Jan 2006, 2:08:35 UTC - in response to Message 11947.  


Currently those are RR and EDF. The method for determining which to use is to simulate RR and see if it fails.


Agreed. This is a sensible start to the problem.

Agreed that we cannot reasonably expect to assess every ordering - but to sample a few likely methods and choose the best is still well within reach.

No solution will ever be perfect - if it was it would take longer to calculate than the wu themselves! So I agree the process of refinement has to stop at some point. My own feeling tho is still that it is sensible to try to refine it a little more than it is at present.

The next steps, as I see them, are firstly to work out some kind of scoring to make a choice between RR and EDF when they both fail. At present the algorithm switches to EDF even in those cases where RR fails less badly than EDF!

Secondly, once a scoring algorithm is adopted, a discount for continuing the status quo will prevent the kind of 'flapping' that we currently see - the essentiual point Paul has been making for a while. All flexible algorithms come up abainst this issue, there is 'anti'flapping' code in internet routers for example to prevent them changing routes for trivial or transient advantages while allowing them still to re-route for a large or long-lived advantage.

On a more prosaic level, we have springs on car axles to keep the wheels on the ground when the car goes over a bump. However without shock absorbers the springs cause chaos. Both responsivity and inertia are needed to get past the very simplest designs in any technology.

River~~

Well, the anti-thrashing code is just to let the results run until the next potential task switch. Then choose which result(s) to run next.


BOINC WIKI
ID: 12001 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 12032 - Posted: 15 Jan 2006, 13:59:57 UTC - in response to Message 12001.  


Well, the anti-thrashing code is just to let the results run until the next potential task switch. Then choose which result(s) to run next.


No, it runs till the next *actual* switch which may be much closer than the switch interval, and on a multi-cpu box often is.

My point, and Paul's if I understand him correctly, is that that is still thrashing and on a timescale of much less than 1 hour. On a multi cpu box you get several switches per hour - all of them switch when anything happens to any of them).

Even when you stick with RR you very often get lots of short fragments which take up swap space unnecessarily.

Once you run a short fragment, all the other cpus reschedule at the end of that fragment. This means that if you have fragments of 10min, 20min, 30min left lying around you will almost inevitably be rethinking the strategy every ten minutes or so, whether the switch interval is 60min or 1000min.

By the time you get to complete the 30min fragment some more short fragments will have been left incomplete.

This is just as bad in terms of swap space etc as my spectacular failure was, and for exactly the same reasons
ID: 12032 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 12427 - Posted: 26 Jan 2006, 1:40:35 UTC

FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens.


BOINC WIKI
ID: 12427 · Report as offensive     Reply Quote
Profile Lee Carre

Send message
Joined: 13 Jul 05
Posts: 23
Credit: 22,567
RAC: 0
Message 12479 - Posted: 27 Jan 2006, 4:20:02 UTC - in response to Message 12427.  

FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens.

what kind of changes were made? (for those, like myself, who are interested in such matters)
ID: 12479 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 12482 - Posted: 27 Jan 2006, 4:47:23 UTC - in response to Message 12479.  

FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens.

what kind of changes were made? (for those, like myself, who are interested in such matters)

A few.

1) an attempt to distinguish between always on connections and sometimes disconnected connections.

2) Removal of the 2 * queue size. Replaced with a computational deadline which is: report deadline - (1 day + queue size + project switch time) for sometimes disconnected clients, and report deadline - (1 day + project switch time) for always connected clients.

3) More emphasis on attempting to keep the queue full for sometimes disconnected clients.

4) For the global work needed the code was changed from queue size * number of cpus - work on hand to an EDF simulation for all CPUs. i.e. start placing the results in individual lists for each CPU in EDF order (only for accounting, not for CPU scheduling). The global work need is the sum of all CPUs that do not have queue size amount of work on hand. This prevents a single CPDN result from showing all of the CPUs on the system as satisfied.

5) Modification of EDF. Instead of EDF, it picks the project with the result of the earliest deadline where the project has a result that is close to or over the computational deadline. This is to allow a CPDN result to get one CPU on a dual CPU system if the CPDN result will take the remaining time to deadline to process even though there are 2 shorter deadline results (say LHC) that are not in danger of going over the deadline. One of these will get the other CPU.

6) In some cases, a single CPU will switch projects leaving the other CPUs running their current work. For example, one CPU is running CPDN result and should run it for an hour. The other CPU is running a Pirates result and will only take 5 minutes. When the Pirates result is done, that CPU alone will be rescheduled, and the CPDN result will continue on for the remainder of the hour. This is also preparation for allowing processes to wait for checkpoints.

I have a feeling that I am forgetting something that I did, but for now, that is what I can remember.

We shall have to wait and see what actually gets adopted.


BOINC WIKI
ID: 12482 · Report as offensive     Reply Quote
Profile Lee Carre

Send message
Joined: 13 Jul 05
Posts: 23
Credit: 22,567
RAC: 0
Message 12496 - Posted: 27 Jan 2006, 11:42:30 UTC - in response to Message 12482.  

FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens.

what kind of changes were made? (for those, like myself, who are interested in such matters)

A few.

[snip]

I have a feeling that I am forgetting something that I did, but for now, that is what I can remember.

We shall have to wait and see what actually gets adopted.

sounds great, all in-demand things :)
ID: 12496 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 12499 - Posted: 27 Jan 2006, 13:41:57 UTC - in response to Message 12482.  
Last modified: 27 Jan 2006, 13:44:14 UTC

Thanks John, I especially appreciate this one


5) Modification of EDF. Instead of EDF, it picks the project with the result of the earliest deadline where the project has a result that is close to or over the computational deadline...


and Paul can't have seen this post yet, as I'm sure I will hear his shouts of joy from the other side of the Pond when he reads


6) In some cases, a single CPU will switch projects leaving the other CPUs running their current work...


Let's hope all your work meets approval - as Lee says each one is in demand from someone

edits : speeling

ID: 12499 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 12505 - Posted: 27 Jan 2006, 15:19:23 UTC - in response to Message 12499.  

Thanks John, I especially appreciate this one

and Paul can't have seen this post yet, as I'm sure I will hear his shouts of joy from the other side of the Pond when he reads


6) In some cases, a single CPU will switch projects leaving the other CPUs running their current work...


Let's hope all your work meets approval - as Lee says each one is in demand from someone

edits : speeling

Paul helped test. I don't have any multiple CPU systems that are fast enough to see this problem.


BOINC WIKI
ID: 12505 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 12520 - Posted: 27 Jan 2006, 22:35:37 UTC

Yes, so far so good. THough I still have it only running on one 4 CPU system. The code that John did seems to be fine. Though I am having some other issues:

1) Attach wizard may have corrupted my Predictor@Home account to the point where my e-mail address does not exist and there is no way to recover the account ... 171,000 cobblestones down the tubes ... I have sent a note to PPAH, but, without an account I cannot post on the boards, and they have not acknowledged the post ... I did post a note about this to the dev forum.

2) There is a crash that seems to be consistent after benchmarks are run. Has happened on two different systems so far. I am waiting to see if it happens again on the Xeon ... and the benchmark numbers are wrong. Of course this is a compile with debug turned on so that may not be unexpected. Well, it did not crash on a forced run of the benchmarks ... but it sure does hammer the numbers! I can fix that though, I saved an old client state file and copied back the old numbers ...


ID: 12520 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 12521 - Posted: 27 Jan 2006, 22:44:09 UTC - in response to Message 12520.  

Yes, so far so good. THough I still have it only running on one 4 CPU system. The code that John did seems to be fine. Though I am having some other issues:

1) Attach wizard may have corrupted my Predictor@Home account to the point where my e-mail address does not exist and there is no way to recover the account ... 171,000 cobblestones down the tubes ... I have sent a note to PPAH, but, without an account I cannot post on the boards, and they have not acknowledged the post ... I did post a note about this to the dev forum.

2) There is a crash that seems to be consistent after benchmarks are run. Has happened on two different systems so far. I am waiting to see if it happens again on the Xeon ... and the benchmark numbers are wrong. Of course this is a compile with debug turned on so that may not be unexpected. Well, it did not crash on a forced run of the benchmarks ... but it sure does hammer the numbers! I can fix that though, I saved an old client state file and copied back the old numbers ...


Is it only after the benchmarks run? Will it work if the benchmarks do not run?


BOINC WIKI
ID: 12521 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 12534 - Posted: 28 Jan 2006, 5:31:54 UTC
Last modified: 28 Jan 2006, 5:38:59 UTC

Yes, still runs fine otherwise. And it does not seem to be everytime the benchmarks run. So, I am still looking to find the exact sequence. But, at the moment, I hesitate to risk more than one system as I am not much into thrill rides... :)

Since we don't seem to be making any progress to getting the code checked in (or at least it was not checked in last I looked this afternoon ...), I don't suppose there is any hurry about anything.

If you have the time, maybe a compile without the debug code since it seems stable otherwise...

I think they also have a problem in the attach wizard as I now have a hammered account. What is worse it is with PPAH and they are about the most unresponsive project going. :(

Well, what is 171,000 cobblestones ...

==== edit
There is some GOOD news though ... Frys had a sale on ... and Nancy said yes ... $1300 later I changed out a P4 2.8 GHz for a dual core AMD 4400 with 1M L2 per CPU ... the benchmarks I got the first time stink a little, so I will have to try them again when I get done with all the patches ...

Oh, not the best MB, Asus A8N32-SLI, 1G Corsair 4200 2-2-2-5 memory ... so, not the best, but not too bad ... at the moment BOINC View says i went from 9 CS/hr to 25 CS/hr ... (i had hoped to see 30 based on an extrapolation of my single core AMD 3500 w/ 1G cache).

Too early to tell for sure how well it will do with work though ... by way of comparison, for what it is worth, the projected speed is roughly the same as my dual Xeon 3.4 GHz
ID: 12534 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 12535 - Posted: 28 Jan 2006, 5:39:30 UTC - in response to Message 12534.  

Yes, still runs fine otherwise. And it does not seem to be everytime the benchmarks run. So, I am still looking to find the exact sequence. But, at the moment, I hesitate to risk more than one system as I am not much into thrill rides... :)

Since we don't seem to be making any progress to getting the code checked in (or at least it was not checked in last I looked this afternoon ...), I don't suppose there is any hurry about anything.

If you have the time, maybe a compile without the debug code since it seems stable otherwise...

I think they also have a problem in the attach wizard as I now have a hammered account. What is worse it is with PPAH and they are about the most unresponsive project going. :(

Well, what is 171,000 cobblestones ...

The code has been checked in.

However, it includes a major bug submitted by someone else as a modification to the LTD calculations. Based on the code checked in, I generated a pair of scanarios where a work fetch from one project and a work complete from a second (requires a third project to make it interesting) were done in a different order, and the LTDs were thousands of seconds different.


BOINC WIKI
ID: 12535 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 12540 - Posted: 28 Jan 2006, 6:22:52 UTC - in response to Message 12535.  

The code has been checked in.

However, it includes a major bug submitted by someone else as a modification to the LTD calculations. Based on the code checked in, I generated a pair of scanarios where a work fetch from one project and a work complete from a second (requires a third project to make it interesting) were done in a different order, and the LTDs were thousands of seconds different.

Oh, my ... well, that is not good ...
ID: 12540 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : can't download


©2025 CERN