21) Message boards : Number crunching : can't download (Message 12505)
Posted 27 Jan 2006 by John McLeod VII
Post:
Thanks John, I especially appreciate this one

and Paul can't have seen this post yet, as I'm sure I will hear his shouts of joy from the other side of the Pond when he reads


6) In some cases, a single CPU will switch projects leaving the other CPUs running their current work...


Let's hope all your work meets approval - as Lee says each one is in demand from someone

edits : speeling

Paul helped test. I don't have any multiple CPU systems that are fast enough to see this problem.
22) Message boards : Number crunching : can't download (Message 12482)
Posted 27 Jan 2006 by John McLeod VII
Post:
FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens.

what kind of changes were made? (for those, like myself, who are interested in such matters)

A few.

1) an attempt to distinguish between always on connections and sometimes disconnected connections.

2) Removal of the 2 * queue size. Replaced with a computational deadline which is: report deadline - (1 day + queue size + project switch time) for sometimes disconnected clients, and report deadline - (1 day + project switch time) for always connected clients.

3) More emphasis on attempting to keep the queue full for sometimes disconnected clients.

4) For the global work needed the code was changed from queue size * number of cpus - work on hand to an EDF simulation for all CPUs. i.e. start placing the results in individual lists for each CPU in EDF order (only for accounting, not for CPU scheduling). The global work need is the sum of all CPUs that do not have queue size amount of work on hand. This prevents a single CPDN result from showing all of the CPUs on the system as satisfied.

5) Modification of EDF. Instead of EDF, it picks the project with the result of the earliest deadline where the project has a result that is close to or over the computational deadline. This is to allow a CPDN result to get one CPU on a dual CPU system if the CPDN result will take the remaining time to deadline to process even though there are 2 shorter deadline results (say LHC) that are not in danger of going over the deadline. One of these will get the other CPU.

6) In some cases, a single CPU will switch projects leaving the other CPUs running their current work. For example, one CPU is running CPDN result and should run it for an hour. The other CPU is running a Pirates result and will only take 5 minutes. When the Pirates result is done, that CPU alone will be rescheduled, and the CPDN result will continue on for the remainder of the hour. This is also preparation for allowing processes to wait for checkpoints.

I have a feeling that I am forgetting something that I did, but for now, that is what I can remember.

We shall have to wait and see what actually gets adopted.
23) Message boards : Number crunching : can't download (Message 12427)
Posted 26 Jan 2006 by John McLeod VII
Post:
FWIW, I have submitted some new scheduler code for consideration. I will have to see what happens.
24) Message boards : Number crunching : Work almost done? (Message 12386)
Posted 25 Jan 2006 by John McLeod VII
Post:
I guess what John Keck is seeing is that his computers download enough LHC units to go into EDF and build up a negative debt while there is work, and that the gaps in LHC work allow debt to "rise" back to zero.

Or just maybe he is using an old client - there was at least one release where the debt went on rising past zero when there was no work - a behaviour I preferred but apparently others shouted it down so it got changed (but that is a different argument)

Wrong explanation. I have the latest alpha client (not GR or CPDNBBC). What happens is there are enough other projects that they cause EDF mode from time to time when there is no work here. If this project is not in an active defferal at that time (or the defferal expires before EDF) the LDT will still rise.

Basically, if the host is not asking for more work, it doesn't know that there is none...
25) Message boards : Number crunching : LHC still requires Boinc 4.45? (Message 12344)
Posted 24 Jan 2006 by John McLeod VII
Post:
I am still using 4.68 and have no problems with it. I know that I should upgrade sometime, when I tried before it lost over 1000 hours on a Climet Sulpher model. Had to restore a backup of 4.68 to get all those hours back.

first computer

secound computer

Ray

You really ought to consider upgrading to something a bit newer.

The sequence was 4.0 -> 4.19, 4.50 -> 4.72, 4.20 -> 4.45, 5.1.x.

-----------------------------------------------------------------------

I got a computational error 35% of the way on a CPDN workunit. How do I restore that workunit? I am running Boinc 4.45 on XP. 2.26GHz processor


Basically, you don't. If you had backed up the folder, and unplugged the internet, then you could have restored and possibly retrieved the WU to keep on crunching. Otherwise, not.
26) Message boards : Number crunching : Work almost done? (Message 12332)
Posted 23 Jan 2006 by John McLeod VII
Post:
If LTD is the same as <debt>, then at least 4.19 does not increase it while no work is available. I'm attached to Orbit for quite some time now and never had anything but 0.000 in the <debt> entry. Same for SZTAKI while it didn't give me work, it went to 0.000 and stayed there.

4.19 was before the split between Short Term Debt and Long Term Debt. What was used in 4.19 is Short Term Debt and was called Debt.
27) Message boards : Number crunching : LHC still requires Boinc 4.45? (Message 12331)
Posted 23 Jan 2006 by John McLeod VII
Post:
I am still using 4.68 and have no problems with it. I know that I should upgrade sometime, when I tried before it lost over 1000 hours on a Climet Sulpher model. Had to restore a backup of 4.68 to get all those hours back.

first computer

secound computer

Ray

You really ought to consider upgrading to something a bit newer.

The sequence was 4.0 -> 4.19, 4.50 -> 4.72, 4.20 -> 4.45, 5.1.x.
28) Message boards : Number crunching : EDF oddity (Message 12239)
Posted 21 Jan 2006 by John McLeod VII
Post:
There are a couple of possibilities:

1) You have a large cache (> 3.5 days in this case). This will keep the system in EDF as long as there is LHC work on the system.

2) The time stats indicate that you do not have as much CPU time as you think that you do.
29) Message boards : Number crunching : Work almost done? (Message 12238)
Posted 21 Jan 2006 by John McLeod VII
Post:
Then there is a downtime of a few days / weeks, and more is issued. I suspect that this pattern will continue until the construction is complete. At which point, I hope that we will be given some of the particle tracks to follow.
30) Message boards : Number crunching : Future completion of the LHC (Message 12237)
Posted 21 Jan 2006 by John McLeod VII
Post:
Actually, it would be really nice if LHC could find some way of sending the particle tracks for crunching to the BOINC clients. Some of us are on always on connections of reasonable bandwidth, and would not mind large downloads.
31) Message boards : Number crunching : Loadsa Work (Message 12099)
Posted 17 Jan 2006 by John McLeod VII
Post:
It looks like they are testing an edge again so we will likely see a lot of early endings. But, I also have a few that are running for hours, the longest I have in sight right now is 5 hours ... so, not to dispair ... :)

I believe that I had one machine that got about 20 in a row with crunch times < 1 min.
32) Message boards : Number crunching : can't download (Message 12001)
Posted 15 Jan 2006 by John McLeod VII
Post:

Currently those are RR and EDF. The method for determining which to use is to simulate RR and see if it fails.


Agreed. This is a sensible start to the problem.

Agreed that we cannot reasonably expect to assess every ordering - but to sample a few likely methods and choose the best is still well within reach.

No solution will ever be perfect - if it was it would take longer to calculate than the wu themselves! So I agree the process of refinement has to stop at some point. My own feeling tho is still that it is sensible to try to refine it a little more than it is at present.

The next steps, as I see them, are firstly to work out some kind of scoring to make a choice between RR and EDF when they both fail. At present the algorithm switches to EDF even in those cases where RR fails less badly than EDF!

Secondly, once a scoring algorithm is adopted, a discount for continuing the status quo will prevent the kind of 'flapping' that we currently see - the essentiual point Paul has been making for a while. All flexible algorithms come up abainst this issue, there is 'anti'flapping' code in internet routers for example to prevent them changing routes for trivial or transient advantages while allowing them still to re-route for a large or long-lived advantage.

On a more prosaic level, we have springs on car axles to keep the wheels on the ground when the car goes over a bump. However without shock absorbers the springs cause chaos. Both responsivity and inertia are needed to get past the very simplest designs in any technology.

River~~

Well, the anti-thrashing code is just to let the results run until the next potential task switch. Then choose which result(s) to run next.
33) Message boards : Number crunching : can't download (Message 11910)
Posted 13 Jan 2006 by John McLeod VII
Post:
The basic problem is choosing the "correct" thing to do at any time. You or I could see at a glance what needs to be done, but the computer does not have that much smarts. Unfortunately on a single CPU system there are n! possible ways of ordering the results for the processor, and the problem gets worse for larger numbers of CPUs. For fairly small numbers of results, picking an ordering out of this set is computationally impossible. (for example with 35 results there are about 1 * 10^40 different orderings - a 1 with 40 0's behind it for those not familiar with scientific notation). Therefore it is required that the programmers pick a very small number of well specified algorythms for the selection of a result to run. And a well specified algorythm for determining which to use.

Currently those are RR and EDF. The method for determining which to use is to simulate RR and see if it fails.
34) Message boards : Number crunching : can't download (Message 11905)
Posted 13 Jan 2006 by John McLeod VII
Post:
The obvious problem with this is that things change. If every unit was a CPDN unit then an execution plan might make sense. Failures in computation, completion of computations, and variable length results (which project does that, I wonder) all change the circumstances. The execution plan would have to be recalculated at each termination and at each switch. It seems pretty much like we have already.

No, not necessarily...

If the work is laid out ... and things are added/changed they yes, the plan needs to be REVIEWED. But, right now we always start with a clean sheet of paper. I am simply saying that if we stop throwing away the prior plan we might be better off.

When new work is downloaded, try to add it to the end of the plan, if that does not work, then begin to perturb it. But, preference should still remain with work already started, and to minimize the total change to what you started with before ...

Well, not a big issue to me either way. When I get the change to stop rescheduing all CPUs are the drop of a hat, I will likely go to a 6-24 hour switch time so that work is run to completion to the greatest extent possible and only then will the minimal change be made ...

I hate to have 75 results in various stages of completion at all times ...

Things that would need to change the schedule you are proposing.
The obvious ones first.
1) New work downloaded.
2) Work completed.

Next a few that are not quite as obvious.
3) Work that is taking longer than it was supposed to based on the initial estimates (there was one project that for a while handed out work that took 800 times as long as the initial estimate).
4) Work that is apparently finishing early.
5) Computer off for a while.
6) BOINC off for a while.

Now for a not so obvious one.
7) Changing processor load.

I have probably missed a few.

All in all, we have to re-calculate frequently. The plans that were valid when the work was downloaded quickly become obsolete.
35) Message boards : Number crunching : can't download (Message 11903)
Posted 13 Jan 2006 by John McLeod VII
Post:

Giving an excess share to the project in trouble is one approach, but it still does not handle all cases properly. It fails spectacularly when ...

Have you seen my suggestion about ...

I actually implemented it, and it failed spectacularly all of the time...

doh!

yes of course it would do that - its effectively looking to run the longest first and that is inherently destabilising... must think some more about this

Maybe I am dense, but, IN THIS CASE, should not the "tie-breaker" be the fact that the one work unit has been started?

==== Edit

The other question is why can't the system derive a "execution plan" and SAVE it, rather than recalculating everything from scratch each switch time. I mean, if we had an execution plan, the work units would be "slotted" and when time came to switch back, the work in progress would be continued.

But it is always looking at the result with the longest run time left. EDF will switch away from a running result if needed.
36) Message boards : Number crunching : can't download (Message 11897)
Posted 12 Jan 2006 by John McLeod VII
Post:

The fix is to have the CPU scheduler keep track of which project is in trouble, and give the extra CPU time to that project. If there are two or more projects with deadline trouble, the one with the earliest deadline gets the CPU first.


Yes, I agree there is a problem with multi cpu handling.

Giving an excess share to the project in trouble is one approach, but it still does not handle all cases properly. It fails spectacularly when there are different length wu from the same project and the longest has the furthest deadline.

Two cpu box, three results all from the same project. Two WU need 24hrs, both due 60hours away; third WU 50 hours long but due 61 hours away. There is enough time to crunch everthing, but only if you let go of the EDF principle. EDF makes you put the two short WU on separate cpus, leaving nowhere to put the biggun.

This can easily happen when downloading a bundle of WU from a project that has variable length work units. It also happened to me on CPDN when due to a server glitch I got a third WU on a two cpu box.

Have you seen my suggestion about preparing lists of work and filling the lists starting with the longest wu? It handles all the scenarios I could think of, (which may not be all of them, of course...) and in addition solves the annoying problem of excessive cpu swapping on a multi-cpu box.

River~~

I actually implemented it, and it failed spectacularly all of the time.

The problem is that if all the results from a particular project have the same time to process (as estimated) it picks up the first one, and after an hour puts that one asside, and then picks up another and after an hour, puts that one aside. Does this untill all results have had an hour of processing. Then it starts over again. This is NOT the way that it should be done (and yes, the system was in EDF at the time) as it loses either large amounts of CPU time while switching or uses large amounts of swap file.

I went from there to having the project get the extra CPU time instead. This works in almost all cases.
37) Message boards : Number crunching : can't download (Message 11892)
Posted 12 Jan 2006 by John McLeod VII
Post:
...If the project that needs the extra attention is not the project with the earliest deadline, the queue can be otherwise emptied. For example a CPDN result that has 3 months left of crunch time and just over that left of wall time will occasionally allow the download of a result from another project - which will then immediately be crunched to completion.
...


I have seen this and it is not as daft as it looks. On relfection it is exactly right.

The short expalantion is that CPDN needs more cpu time than the resource share allocates, but does not need all the cpu time till it completes. The box teeters along the crossover between allowing and not allowing work fetch in the way that allows the best compromise between the two demands.

The long explanation:

Suppose CPDN needs 95% of the crunching to get to the critical point. EDF gives it 100%, which makes the situation less critical because CPDN is getting 5% more CPU than it needs to deliver on time. Eventually it relaxes to the point where work fetch is allowed but the box remains in EDF.

Another project (presumalby the one with the highest LTD) gets downloaded. If it has the highest LTD it is the one that has suffered worst from CPDN going into EDF. That project's WU gets crunched because it has a shorter deadline than CPDN, by which time either

- The next most urgent LTD gets downloaded; or

- the situation is critical again and CPDN gets 100% once more: repeat from top.

It is a beautiful example of how the system looks counter intuitive in the short term but is actually close to perfect in its long term effect. Please do *not* let anyone bully you into fixing this - it is not a bug.

The only bug is in participants' understanding of the effect, and the place to fix that is in the wiki, together with copious links to it as necessary.

River~~

This is indeed the way it works. However, there is a problem with multiple CPU systems (HT or Real).

A CPDN result needs 100% of the time on one CPU to complete in three months. It cannot use the second CPU. The second CPU can then only keep one result downloaded at a time or the following occurs:

The system is in EDF because of the CPDN result (obvious, but someone will need to be reminded). The system downloads 2 results for some other app (S@H for example). Since these have shorter deadlines, BOTH will be assigned to CPUs to crunch. If they finish at the same time, the system needs to download more work, gets 2 more results which are then both assigned to CPUs. So far, we have crunched 4 results for other projects and NOT done any work on the problem project (CPDN).

The fix is to have the CPU scheduler keep track of which project is in trouble, and give the extra CPU time to that project. If there are two or more projects with deadline trouble, the one with the earliest deadline gets the CPU first.
38) Message boards : Number crunching : can't download (Message 11878)
Posted 10 Jan 2006 by John McLeod VII
Post:
OK, the history part first.

The first scheduler used a decaying average to calculate how much work to give to a project. CPDN made this completely hopeless.

The second scheduler used Short Term Debt only to calculate what to crunch. It always tried to get work from all projects and do the load balancing this way. The fifth project broke this. Any setting of the resource shares that was not pretty much equal also broke this.

The third round of the scheduler used the LTD to determine what to download. It did not always attempt to keep work from all projects. It always incremented the LTD for projects that did not have the CPUs attention. A multi month outage at one of the projects broke this rather badly (nobody's queues would fill).

The solution to this was to not increment the LTD of projects that did not have work on the system and also had a communications deferral. This leads to a project that does not give out work on a regular basis losing some time permanently.

The current round of scheduler work fixes some things and I will have to look to see if the LTD can increase again for any project that is not running on a CPU, with the exceptions of (NNW and no work) and suspended projects of course. And, yes, it is still more complex internally, but hopefully it will work better for everyone.

The current rules:

If a project is suspended, the project does not participate in the LTD calculation at the end of a CPU scheduling period (hourly or early terminated by user action or completion of a result...)

If a project is marked NNW and it has no work on the system, it does not participate in the end of cycle LTD calculation.

If a project has communications deferred and it has no work on the system it does not participate in the end of cycle LTD calculation.

Resetting a project temporarily sets the LTD to 0 (it is moved away again when setting the mean LTD to 0).

Effects.

If a host enters NWF and stays there for some time, all projects will eventually start to participate in the LTD calculation.

If a project has a very negative LTD, it will probably be participating in the LTD calculation.

If a host is offline, all projects will eventually start to participate in the LTD calculation.

Recent fixed bugs:

In some earlier versions the work in current period numbers were not zeroed after each work cycle and therefore the LTD (and STD) numbers were very far off. Fixed with 5.1.something.

The STD is not correctly setting the mean to 0 if a project suddenly has no work (reset, detached, or last result completed). The fix for this has been submitted.

Known issues:

There are cases where normally disconnected users do not get a full enough queue.

The LTD does not increment for projects that have a communications deferral.

The CPUs always switch at the same time.

EDF can do some odd things. If the project that needs the extra attention is not the project with the earliest deadline, the queue can be otherwise emptied. For example a CPDN result that has 3 months left of crunch time and just over that left of wall time will occasionally allow the download of a result from another project - which will then immediately be crunched to completion.

Calculation of mean LTD and STD:

39) Message boards : Number crunching : project priority ideas. (Message 11444)
Posted 23 Nov 2005 by John McLeod VII
Post:
Don't know what you mean by:

(Don't make me add this as a check for EDF).


But I do know that Boinc is smart enough to switch to another project to make sure it gets done on time.


1) John means that he's the one writing the panic mode code
2) Someone correct me if I'm wrong, but I think the problem with a 24-hour value for 'switch every' would help DEFEAT the panic mode code. If what I remember is right, the EDF scheduler only kicks in when results are switched, i.e. at the end of the 'switch every' timer. So by going to a 24-hour value, the panic mode could only possible kick in A) every 24 hours or B) every time a work unit is done (or C) every time you hit Update manually). If you have a slow system crunching that takes more than 24 hours to crunch a result, then you would very likely miss deadlines.

Granted, this is founded on the ideas that: 1) you have a slower machine or large results, and 2) that I'm right about EDF and how it kicks in.

Can anyone affirm or correct this for me?

(j)
James

Confirmed.
40) Message boards : Number crunching : project priority ideas. (Message 11429)
Posted 21 Nov 2005 by John McLeod VII
Post:
I don't know what you said different, but that's what I'm going to do.

Change every 24hrs.

......Can't get enough of thOSE, InFO threaaaaads........

:)

Change every 24 hours may be a bit extreme. If you have a result that is due 25 hours from now, and it has 2 hours of CPU remaining, and some other project gets picked first, it is going to be late. I would suggest using a number somewhat lower than 24 hours. (Don't make me add this as a check for EDF).


Previous 20 · Next 20


©2024 CERN