Message boards : Number crunching : can't download
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11814 - Posted: 4 Jan 2006, 13:31:59 UTC - in response to Message 11810.  


It was originally designed to continue increasing debt any time a project did not have work. However many complaints about projects dominating a host after an outage caused this to be changed. I do not think it will be possible politically to get it changed back. Technically it should be fairly easy, just comment out a few blocks of code.


You mean it was working but they deliberately broke it

Of course a project dominates after an outage - it has to catch up. Hence the term LONG term debt. As in debt that has arisen in the LONG term, duh?

We don't remove LTD because people complain when a project is blocked for going into negative LTD, even tho the micro managers want us to (and our friend thought that was what I was asking for).

After all the comments from the devs about micromanagers breaking things, they let the advocates of micromanagement lurk the micromanagement code into the mainsteam - well all I can say is *!) ""££ ^££$ ^"%^ "!!!&!! &*% ^£% ^%£^"%^ !"^!^!! $$"£ $"$!

A tactful translation might be:
The people who don't want to play catch-up can jolly well reset after an outage - there is no easy way to simulate the code that has been deliberately broken for them.

R~~
ID: 11814 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11816 - Posted: 4 Jan 2006, 14:23:09 UTC - in response to Message 11783.  

...I would much prefer to put much of my time to hard physics (why I am bent I cannot specify AstroPulse only, or at least that is not in the plans) ... and At this time Einstein@Home is the only game in town for that ...


As I see it, LHC *is* hard physics. Experiment is a vital part of physics (or has been since we improved on the Greek model of science). Designing an experiment is as much part of the science as sifting the results.

E@h is not physics in the sense that that project does not crunch massive solutions of Einsteins eqn (now that *would* be a shared enterprise that would attract me, and I guess you too). The crunchers in E@h are filtering the experimental results. But it is hard physics in the sense that it is an essential part of the physics.

LHC is not looking at results, but at experimental design - equally ancillary and equally essential as result-sifting. In my opinion, of course - we may well still differ as we often do ;-).

If the LHC finds the Higgs I will feel just as much a minor contributor to that through my LHC cobblestones helping design the infrastructure as I will feel a contributor to the finding of the gravy waves if E@h strikes lucky. And if either experiment fails to find the result, that too is interesting physics and then too I will fee I have contributed.

Both areas interest me personally - having taught General Relativity for years, and long before that having done a MSc in particle physics (including a summer at CERN itself). I feel very lucky that both my favourite areas of physics are represented within one DC platform, ie BOINC.

R~~

ID: 11816 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 11820 - Posted: 4 Jan 2006, 18:22:34 UTC

... finding of the gravy waves if E@h strikes lucky.


Hmmm, and I thought we had left Thanksgiving behind us ... :)

I get your point that the design and testing of the set-up is important. And, my CS numbers should show that I have aimed a lot of attention in that direction. Especially when you consider the amount of time when there has not been any work.
ID: 11820 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11824 - Posted: 4 Jan 2006, 20:28:42 UTC - in response to Message 11820.  


And, my CS numbers should show that I have aimed a lot of attention in that direction.


absolutely so - when I see your LHC stats I see a great contribution to real physics, whether you felt like that at the time or not, and I am sure the particle physicists will see it that way :-)

R~~
ID: 11824 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 11830 - Posted: 4 Jan 2006, 22:34:21 UTC - in response to Message 11824.  


And, my CS numbers should show that I have aimed a lot of attention in that direction.


absolutely so - when I see your LHC stats I see a great contribution to real physics, whether you felt like that at the time or not, and I am sure the particle physicists will see it that way :-)

Well, I kinda liked the reception we got when we had just been running for a little while. With the investment of just one small PC server they had tapped into computing power equal to one of the machines that they used in the lab. And we were under a "cap" for total participants.

But part of it was always betting on the next stage. Will they come up with a new mission from CERN for us to do? In theory, when the switch is thrown and the machine goes live, what we are doing is no longer needed. Again, this is if I understood the tea leaves.

More important is the fact that if we COULD go to the next step, running the simulations for the colliders themselves, well, now we have opened up work not only from CERN, but from every other collider in the world...

And the good news is that is being looked into. Some difficulties, like large data sets, but, heck, even my small 36 G disk drives will be big enough as those systems rarely have more than 3-4G used. Heck my windows work station has 45 G free of 68 ...
ID: 11830 · Report as offensive     Reply Quote
Profile Keck_Komputers

Send message
Joined: 1 Sep 04
Posts: 275
Credit: 2,652,452
RAC: 0
Message 11845 - Posted: 6 Jan 2006, 11:40:32 UTC

Allthough I personally like my computers changing projects frequently, I often feel that it would be better for the BOINC project to dump the CPU scheduler entirely. It's not that JM7 hasn't done a good job with it and he is still working on improvements, but the previous system was simple, robust and allowed better control of the queue.

When the previous system needed more work it contacted projects in order of LTD until the queue was full. Work from the queue was then run earliest deadline first until completion. During and after an outage a project would have less trafic because once a host tried once it would get work from another project to tide it over and not try again until that work was completed. Setting a 3 to 4 day queue meant you normally had 3.5 days of work on hand no matter how many projects you were attached to. You could attach to as many projects as you wanted to on even the slowest computer (mostly fixed now, but a real problem before 4.36).

What killed this system and will most likely keep it from returning was impatience of some of the participants. They were not willing to run CPDN only for months at a time. It did have some other problems, both real and percieved, but these could have been fixed.
BOINC WIKI

BOINCing since 2002/12/8
ID: 11845 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11846 - Posted: 6 Jan 2006, 14:28:14 UTC - in response to Message 11845.  

What killed this system and will most likely keep it from returning was impatience of some of the participants. They were not willing to run CPDN only for months at a time. It did have some other problems, both real and percieved, but these could have been fixed.


I can see both sides of this. EDF mode as you say would mean running a CPDN to completion once it started. Now, having a wu grab the machine for a day or two is one thing, but there are real disadvantages if it goes on for months.

Using CPDN as a backup project "so I always have work" is ruled out - every time there was an outage (frequent on SETI, expected on LHC) your client would lock on to CPDN for months. (A slab took about 3months on an 800MHz box, a sulphur probably about 9 months). This is not what most people want from a backup project.

Secondly it makes it impossible to run CPDN as a background project while waiting for work from LHC, or any other deliberately intermittent project. I can imagine many uses of DC that will develop as it becomes possible to have a pool of donors who wait for you to have work. NASA calculating flight windows; epidemic control tracking flu outbreaks, all sorts of things that nobody is willing to try yet because of the feeling that DC is a long term committment.

Yet one of the opportunities BOINC opens up is to have an internittent project as a first-choice, but have the box get on with other stuff in the gaps. Basically I'd like to see this sort of option encouraged, not discouraged.

River~~
ID: 11846 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11847 - Posted: 6 Jan 2006, 14:40:31 UTC - in response to Message 11830.  


But part of it was always betting on the next stage. Will they come up with a new mission from CERN for us to do?


Yes - the beam physicists having shown that DC can be a reliable sourve of computing power the particle physicists are more likely give it a try too


In theory, when the switch is thrown and the machine goes live, what we are doing is no longer needed. Again, this is if I understood the tea leaves.


No. The particle physicists from time to time will ask the beam physicists "can you do X" and the beam physicists will come back to the crunching.

The most extreme example of this was in the 70s when the J/psi was discovered, weighing in at an annoying 3.05 GeV. Why so annoying? Because at the time almost everyone who was building bigger machines was aiming at that nice round number 3.00 GeV. "Can we just stretch another 2 or 3 percent" was heard by many a beam physicist around that time...

My reading of the tealeaves is that there will be long pauses between work once the machine is working, but that they will want to keep at least a few thousand cpus on tap for those odd changes of plan by the particle physicists.

River~~
ID: 11847 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 11851 - Posted: 6 Jan 2006, 19:03:58 UTC

Well, we can hope that they do multi-application. The current one as you say River for testing, and another for the other side of the house. Maybe we would have a more conssistent suppply of work.
ID: 11851 · Report as offensive     Reply Quote
Profile Lee Carre

Send message
Joined: 13 Jul 05
Posts: 23
Credit: 22,567
RAC: 0
Message 11853 - Posted: 7 Jan 2006, 1:12:25 UTC

I personally agree with River's comments about which situations should increase LTD and the CPU scheduler, i don't micromanage, but i like the idea of switching between projects and the general way in which it does that (with a few annoyences as stated by paul in a seti thread)
ID: 11853 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 11878 - Posted: 10 Jan 2006, 2:30:26 UTC

OK, the history part first.

The first scheduler used a decaying average to calculate how much work to give to a project. CPDN made this completely hopeless.

The second scheduler used Short Term Debt only to calculate what to crunch. It always tried to get work from all projects and do the load balancing this way. The fifth project broke this. Any setting of the resource shares that was not pretty much equal also broke this.

The third round of the scheduler used the LTD to determine what to download. It did not always attempt to keep work from all projects. It always incremented the LTD for projects that did not have the CPUs attention. A multi month outage at one of the projects broke this rather badly (nobody's queues would fill).

The solution to this was to not increment the LTD of projects that did not have work on the system and also had a communications deferral. This leads to a project that does not give out work on a regular basis losing some time permanently.

The current round of scheduler work fixes some things and I will have to look to see if the LTD can increase again for any project that is not running on a CPU, with the exceptions of (NNW and no work) and suspended projects of course. And, yes, it is still more complex internally, but hopefully it will work better for everyone.

The current rules:

If a project is suspended, the project does not participate in the LTD calculation at the end of a CPU scheduling period (hourly or early terminated by user action or completion of a result...)

If a project is marked NNW and it has no work on the system, it does not participate in the end of cycle LTD calculation.

If a project has communications deferred and it has no work on the system it does not participate in the end of cycle LTD calculation.

Resetting a project temporarily sets the LTD to 0 (it is moved away again when setting the mean LTD to 0).

Effects.

If a host enters NWF and stays there for some time, all projects will eventually start to participate in the LTD calculation.

If a project has a very negative LTD, it will probably be participating in the LTD calculation.

If a host is offline, all projects will eventually start to participate in the LTD calculation.

Recent fixed bugs:

In some earlier versions the work in current period numbers were not zeroed after each work cycle and therefore the LTD (and STD) numbers were very far off. Fixed with 5.1.something.

The STD is not correctly setting the mean to 0 if a project suddenly has no work (reset, detached, or last result completed). The fix for this has been submitted.

Known issues:

There are cases where normally disconnected users do not get a full enough queue.

The LTD does not increment for projects that have a communications deferral.

The CPUs always switch at the same time.

EDF can do some odd things. If the project that needs the extra attention is not the project with the earliest deadline, the queue can be otherwise emptied. For example a CPDN result that has 3 months left of crunch time and just over that left of wall time will occasionally allow the download of a result from another project - which will then immediately be crunched to completion.

Calculation of mean LTD and STD:




BOINC WIKI
ID: 11878 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11887 - Posted: 11 Jan 2006, 15:38:20 UTC - in response to Message 11878.  

...If the project that needs the extra attention is not the project with the earliest deadline, the queue can be otherwise emptied. For example a CPDN result that has 3 months left of crunch time and just over that left of wall time will occasionally allow the download of a result from another project - which will then immediately be crunched to completion.
...


I have seen this and it is not as daft as it looks. On relfection it is exactly right.

The short expalantion is that CPDN needs more cpu time than the resource share allocates, but does not need all the cpu time till it completes. The box teeters along the crossover between allowing and not allowing work fetch in the way that allows the best compromise between the two demands.

The long explanation:

Suppose CPDN needs 95% of the crunching to get to the critical point. EDF gives it 100%, which makes the situation less critical because CPDN is getting 5% more CPU than it needs to deliver on time. Eventually it relaxes to the point where work fetch is allowed but the box remains in EDF.

Another project (presumalby the one with the highest LTD) gets downloaded. If it has the highest LTD it is the one that has suffered worst from CPDN going into EDF. That project's WU gets crunched because it has a shorter deadline than CPDN, by which time either

- The next most urgent LTD gets downloaded; or

- the situation is critical again and CPDN gets 100% once more: repeat from top.

It is a beautiful example of how the system looks counter intuitive in the short term but is actually close to perfect in its long term effect. Please do *not* let anyone bully you into fixing this - it is not a bug.

The only bug is in participants' understanding of the effect, and the place to fix that is in the wiki, together with copious links to it as necessary.

River~~
ID: 11887 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 11888 - Posted: 11 Jan 2006, 16:10:22 UTC

Maybe we just need a better class of participant?

Just a thought ...
ID: 11888 · Report as offensive     Reply Quote
koldphuzhun

Send message
Joined: 14 Dec 05
Posts: 4
Credit: 782,827
RAC: 3,242
Message 11890 - Posted: 11 Jan 2006, 22:04:01 UTC

Better class of participation? How about some WU's so that a class can think about participation. As it is, I feel like we're going in a circle waiting for new work to arrive. Oh wait, we are. At least we're keeping the database busy with message posts.
ID: 11890 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 11892 - Posted: 12 Jan 2006, 0:26:53 UTC - in response to Message 11887.  

...If the project that needs the extra attention is not the project with the earliest deadline, the queue can be otherwise emptied. For example a CPDN result that has 3 months left of crunch time and just over that left of wall time will occasionally allow the download of a result from another project - which will then immediately be crunched to completion.
...


I have seen this and it is not as daft as it looks. On relfection it is exactly right.

The short expalantion is that CPDN needs more cpu time than the resource share allocates, but does not need all the cpu time till it completes. The box teeters along the crossover between allowing and not allowing work fetch in the way that allows the best compromise between the two demands.

The long explanation:

Suppose CPDN needs 95% of the crunching to get to the critical point. EDF gives it 100%, which makes the situation less critical because CPDN is getting 5% more CPU than it needs to deliver on time. Eventually it relaxes to the point where work fetch is allowed but the box remains in EDF.

Another project (presumalby the one with the highest LTD) gets downloaded. If it has the highest LTD it is the one that has suffered worst from CPDN going into EDF. That project's WU gets crunched because it has a shorter deadline than CPDN, by which time either

- The next most urgent LTD gets downloaded; or

- the situation is critical again and CPDN gets 100% once more: repeat from top.

It is a beautiful example of how the system looks counter intuitive in the short term but is actually close to perfect in its long term effect. Please do *not* let anyone bully you into fixing this - it is not a bug.

The only bug is in participants' understanding of the effect, and the place to fix that is in the wiki, together with copious links to it as necessary.

River~~

This is indeed the way it works. However, there is a problem with multiple CPU systems (HT or Real).

A CPDN result needs 100% of the time on one CPU to complete in three months. It cannot use the second CPU. The second CPU can then only keep one result downloaded at a time or the following occurs:

The system is in EDF because of the CPDN result (obvious, but someone will need to be reminded). The system downloads 2 results for some other app (S@H for example). Since these have shorter deadlines, BOTH will be assigned to CPUs to crunch. If they finish at the same time, the system needs to download more work, gets 2 more results which are then both assigned to CPUs. So far, we have crunched 4 results for other projects and NOT done any work on the problem project (CPDN).

The fix is to have the CPU scheduler keep track of which project is in trouble, and give the extra CPU time to that project. If there are two or more projects with deadline trouble, the one with the earliest deadline gets the CPU first.


BOINC WIKI
ID: 11892 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11896 - Posted: 12 Jan 2006, 19:08:16 UTC - in response to Message 11892.  


The fix is to have the CPU scheduler keep track of which project is in trouble, and give the extra CPU time to that project. If there are two or more projects with deadline trouble, the one with the earliest deadline gets the CPU first.


Yes, I agree there is a problem with multi cpu handling.

Giving an excess share to the project in trouble is one approach, but it still does not handle all cases properly. It fails spectacularly when there are different length wu from the same project and the longest has the furthest deadline.

Two cpu box, three results all from the same project. Two WU need 24hrs, both due 60hours away; third WU 50 hours long but due 61 hours away. There is enough time to crunch everthing, but only if you let go of the EDF principle. EDF makes you put the two short WU on separate cpus, leaving nowhere to put the biggun.

This can easily happen when downloading a bundle of WU from a project that has variable length work units. It also happened to me on CPDN when due to a server glitch I got a third WU on a two cpu box.

Have you seen my suggestion about preparing lists of work and filling the lists starting with the longest wu? It handles all the scenarios I could think of, (which may not be all of them, of course...) and in addition solves the annoying problem of excessive cpu swapping on a multi-cpu box.

River~~
ID: 11896 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 11897 - Posted: 12 Jan 2006, 19:32:03 UTC - in response to Message 11896.  


The fix is to have the CPU scheduler keep track of which project is in trouble, and give the extra CPU time to that project. If there are two or more projects with deadline trouble, the one with the earliest deadline gets the CPU first.


Yes, I agree there is a problem with multi cpu handling.

Giving an excess share to the project in trouble is one approach, but it still does not handle all cases properly. It fails spectacularly when there are different length wu from the same project and the longest has the furthest deadline.

Two cpu box, three results all from the same project. Two WU need 24hrs, both due 60hours away; third WU 50 hours long but due 61 hours away. There is enough time to crunch everthing, but only if you let go of the EDF principle. EDF makes you put the two short WU on separate cpus, leaving nowhere to put the biggun.

This can easily happen when downloading a bundle of WU from a project that has variable length work units. It also happened to me on CPDN when due to a server glitch I got a third WU on a two cpu box.

Have you seen my suggestion about preparing lists of work and filling the lists starting with the longest wu? It handles all the scenarios I could think of, (which may not be all of them, of course...) and in addition solves the annoying problem of excessive cpu swapping on a multi-cpu box.

River~~

I actually implemented it, and it failed spectacularly all of the time.

The problem is that if all the results from a particular project have the same time to process (as estimated) it picks up the first one, and after an hour puts that one asside, and then picks up another and after an hour, puts that one aside. Does this untill all results have had an hour of processing. Then it starts over again. This is NOT the way that it should be done (and yes, the system was in EDF at the time) as it loses either large amounts of CPU time while switching or uses large amounts of swap file.

I went from there to having the project get the extra CPU time instead. This works in almost all cases.


BOINC WIKI
ID: 11897 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 11899 - Posted: 13 Jan 2006, 7:49:13 UTC - in response to Message 11897.  


Giving an excess share to the project in trouble is one approach, but it still does not handle all cases properly. It fails spectacularly when ...

Have you seen my suggestion about ...

I actually implemented it, and it failed spectacularly all of the time...

doh!

yes of course it would do that - its effectively looking to run the longest first and that is inherently destabilising... must think some more about this

Thanks for trying.

River~~
ID: 11899 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 11901 - Posted: 13 Jan 2006, 11:11:54 UTC - in response to Message 11899.  
Last modified: 13 Jan 2006, 11:15:16 UTC


Giving an excess share to the project in trouble is one approach, but it still does not handle all cases properly. It fails spectacularly when ...

Have you seen my suggestion about ...

I actually implemented it, and it failed spectacularly all of the time...

doh!

yes of course it would do that - its effectively looking to run the longest first and that is inherently destabilising... must think some more about this

Maybe I am dense, but, IN THIS CASE, should not the "tie-breaker" be the fact that the one work unit has been started?

==== Edit

The other question is why can't the system derive a "execution plan" and SAVE it, rather than recalculating everything from scratch each switch time. I mean, if we had an execution plan, the work units would be "slotted" and when time came to switch back, the work in progress would be continued.
ID: 11901 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 11902 - Posted: 13 Jan 2006, 13:21:39 UTC - in response to Message 11901.  



The other question is why can't the system derive a "execution plan" and SAVE it, rather than recalculating everything from scratch each switch time. I mean, if we had an execution plan, the work units would be "slotted" and when time came to switch back, the work in progress would be continued.


The obvious problem with this is that things change. If every unit was a CPDN unit then an execution plan might make sense. Failures in computation, completion of computations, and variable length results (which project does that, I wonder) all change the circumstances. The execution plan would have to be recalculated at each termination and at each switch. It seems pretty much like we have already.


Gaspode the UnDressed
http://www.littlevale.co.uk
ID: 11902 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : can't download


©2024 CERN