Message boards :
Number crunching :
When you see work is around ...
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
When you see work around, please look before rushing in. Desti recently posted this in another thread: The new server has some serious database problems, it's always overloaded. In fact this is not a server side problem, it is caused by user behaviour. When work comes onto the server, as soon as people notice they rush for it, which is good for those at the front of the queue but then overloads the server. You may notice that in times when there is no work, your client drops back into the mode where it checks with LHC for work every three to five hours. This is to protect the database. Sadly, people know that with the current small releases of work, it will all disappear in a single hour, so if they see there is work there they make their clients ask for work immediately. As 1min is 1/240th of 4 hrs, the database would have to be designed to take a peak load of about 250 times its current peak load capacity to survive. That would not be a good use of the project's resources. Yesterday there was work on the server around 1500 UTC. It had all gone by 1600, before the feeding frenzy killed the database, as you can see if you look at Scarecrow's Graphs - see how the line starts to drop before the gap. Even more daft, there was another feeding frenzy around midnight UTC, probably people in the Americas who check their clients on getting home from work (and btw, I am not blaming you guys any more than us, cos we Europeans were undoubtedly responsible for the earlier feeding frenzy). But just 37 tasks were issued in the hour after the db came back, and after those had already gone people hoping for more killed the db again. A rush while there is somnething to get is unhelpful but understandable. A rush after it has all gone is not even sensible in selfish terms. In fact, these two database outages made no difference to the project - all the work was out the door by then. Of course it is annoying that the same db runs these boards, so we could not talk about the outage. So rule 1 - ideally don't be a shark Rule 2, for non-ideallists breaking rule 1, please look at Scarecow's Graphs to see if there is any chance of getting something. (Scarecrow: let us know if the traffic gives you / bluenorthern / your isp any problems!) Rule 3, when breaking rules 1 & 2, at least have the plain common sense to stop asking for work when the databse is already overloaded. You cannot possibly get any work and you are prolonging the problem. Leave your client alone to go into every-four-hour mode and the db will come right. River~~ |
Send message Joined: 27 Sep 04 Posts: 282 Credit: 1,415,417 RAC: 0 |
Agreed. I have my cpu's attached to LHC. I don't check the server every 10 min for new work. So I haven't gotten any lately, but aren't there more important things to do in life rather than checking for LHC work? Cheers, Thorsten |
Send message Joined: 16 Jul 05 Posts: 84 Credit: 1,875,851 RAC: 0 |
What servers does LHC@home use? Other projects have MUCH more users and they have no problems with their database connections, even when they are out of work (like SIMAP). |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
What servers does LHC@home use? In my opinion the difference is the way some LHC users (and probably only a small minority) have become fanatical about getting work when there is some. There is no problem on LHC while there is no work. The problem always arises just after work is issued as people adopt their own strategies to try to get some ahead of everyone else. And from the project viewpoint there is no need for better servers when (like this time) the breakpoint only arrives after they have got all the work out the door. R~~ |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
Agreed. Like checking for work on Orbit@home ;-) |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Do you have any data to support this conjecture? I know you say it is only an opinion - but perhaps it is an informed opinion? Or, perhaps is it not? This issue has come up before. Some have suggested that the admins limit the number of concurrent server connections to a number where if x connections cause the server to crash then n would be less than x. Or if that still fails, then re-evaluate the value of x and revise it downward... Most servers have such control features. Many Admins use them to prevent outages. Aren't the admins at a new location? Perhaps they are new to admining a BOINC server. Perhaps they will work it out soon. Given that there is a limited amount of work to be done, throttling the server should limit effective crunching only slightly. And, throttle limits will be much more benign than "crashes"... |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
.... 1 (fact): what some users have said in the past on these boards about their own behaviour, or advising others to behave like this -- often that advice is given in good faith not realising the knock on effect it can have. 2 (assumption): that for each person posting advice there are many who do the same thing (either by taking that advice or by working it out for themselves) 3 (informal observation): the gaps in db service seem to cluster around two times of day, ~1600 and ~0000 UTC rather than spread out randomly around the clock 4 (unsupported assumption): these correlate with people in Europe and the Americas respectively coming in from work/study and checking their boxes and intervening if they feel they know better than the BOINC client 5 (fact): the backoff algorithm contains a random element specifically to ensure that if all clients have been left running for several days without manual intervention, their db accesses will be spread evenly around the clock 6 (coutervailing argument): the same db is used for forum messages so that it can be argued that it is the combined even load of cleint access plus the uneven load of forum access. 7 (observation / rebuttal) These boards contain a count of the numbers of views of these threads. These do not jump sufficiently over a 24hr period for it to be forum access killing the db 8 (informal observation): the gaps in service do not occur at the time of peak work issue 9 (assumption): if it were an issue of db loading, then I would expect the db queue to be longest when it was issuing work, as that is when it takes longest to answer any individual request. If the post office clerk runs slower the queue gets longer fairly soon, not later in the day when the fast clerk is back on duty. I hope the above is well argued. Each reader will need to decide if it is well informed. R~~ |
Send message Joined: 2 Sep 04 Posts: 121 Credit: 592,214 RAC: 0 |
AFAIK, the whole server always begins to buckle at around 70-80 concurrent connections (never saw it being higher). IMHO, that is a hughe server problem indeed, as good (busy) BOINC servers/pipes must handle 100 times those connections numbers at peak times. Just my 2 cents from 2 years of various BOINC Project server observations. It is not unusual for small projects to suffer when attemting to feed a work-depleted community as a worst-case scenario (which by design happens all the time at LHC ever since its main Project was completed). Larger Projects easily hit their natural Network limitations that way, but's that due to logical sheer mass (SETI comes to mind) and no Project can financially afford a 24/7 GigaBit Pipe so far (AFAIK) BOINC by itself will prevent Users from requesting more than 1 Update per Host and Minute, but that's as good as it gets Client-side (assuming everone involved would jump the train of a fresh Batch of work at the very same time), meaning that even I could request not more than 24 Updates per Minute (if I was noting it at the right time and be crazy enough about it - which never happens and requires alot of luck as well). Given the extremely tight connection limit though, that would be 1/3 of Scheduler/Connection capacity for the time of the request already - which indicates how little bandwidth/capacity their Hardware/Pipe apparently has to offer (my thinking might be wrong and miss some detail tough). Scientific Network : 45000 MHz - 77824 MB - 1970 GB |
Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0 |
It is more likely boinc Client behaviour rather than user behaviour. When a project has no work, the client holds off on connecting for longer periods of time. When a project has work the client connects more frequently automatically. When a thousand machines get credit, the clients on those machines will try to connect more often. I'm not the LHC Alex. Just a number cruncher like everyone else here. |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... How does one go about "...intervening..." to cause the BOINC client to get more work? |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... How did you learn that the same DB is used for the forum messages and the workunits? I've been looking for a server status page, but haven't seen any... |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Might it be the case that if server capacity is the limiting factor, that it automatically limits the total "count of the number of views?" If any attempt is made to have more views than the server has capacity, then that in itself would prevent the counts from jumping. Are we not masking the "peak" concurrent loads when we look at a 24 hour accumulation? Without continuous graphs of the number of concurrent connections, we can't know whether the counts are limited by capacity limitations or by the lack of demand for connections. Or can we? If so, how? |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Is it possible that many machines are "at work" machines and that they are NOT actively seeking work during the day (because the are set to not seek work when they are actively being used in their business function). Then, when their business user goes home, they begin to actively seek BOINC work? |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
.... Deduced from the fact that the main page shows the error 1040 message relating to db connections on the same occasions as the forums fail with the same 1040 message relating to db connections Also plausible as evey posting show the user's credit & RAC so if there were two db's they'd need to talk to each other an awful lot R~~ |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
.... They should still be on a 4hr backoff. You make a good point. If it is more than 4hrs since their last attempt, the code may well allow them to try immediately. Your point also explains, better than mine does, why we got the time-of-day problem the day after we got new work. From memory I think it does. R~~ |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
How does one go about "...intervening..." to cause the BOINC client to get more work? How many other people want to know this? How many of those who want to know it are just curious, and how many will use it and (if I am right about what is happening) thereby add to the effect? And should I answer at all, given that some people will (according to my argument) misuse the info? Had to think about that. Well I will answer, as it is not hard to work out from the wiki or from other documentation. And it is unfair for some to know and others not. When you click on update you reset the backoff algorithm. That project goes from checking every four hours or so to checking every minute. After about 5 tries is starts to back off again, but it takes days before it falls back to the 4-hr backoff. If only 1% of hosts do this, the number of scheduler requests per minute will more than treble. But if you -- any reader -- are tempted to do this, please at least keep rule 2 and 3 (see my original post). Please don't do this after the Results in Progress curve turns down again on Scarecrow's graphs -- if the graph is sloping down you have already missed out and clicking update will gain you nothing. If I am right, clcking update after the graph turns down will help everyone else kill the db connection. River~~ |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
It is more likely boinc Client behaviour rather than user behaviour. In August (the last month where BoincStats had data) less than 300 clients got credit even taken over several small work releases. And many of those boxes would not have come back for more work in the timescale we are seeing, as the number of WU taken per box suggests longer caches. Machines with multi-day caches make fewer requests when they have work than when the don't, in contrast to those with small caches. We did not see this happen when there were larger work releases and >6000 boxes getting credit, we do see it happen when there are small work releases and <300 boxes getting credit. This suggests to me it is not a client-with-work issue. R~~ |
Send message Joined: 2 Sep 04 Posts: 121 Credit: 592,214 RAC: 0 |
It is more likely boinc Client behaviour rather than user behaviour. No, since the Clients auto-pause, using ever-longer Time periods. Given the speed the current small batches are sent out, the majority of Clients won't even know there was work available (since they connect a day or eventually even a whole week too late in many cases). When there is work, the Clients (which are lucky) to send their - by now weekly - request at the right moment connect only once and retrieve the work, that's it. When machines get credit, the actual Clients won't even notice until (again) connecting once a while, it does not affect Client behaviour in any way. It's entirely a Server issue, nothing more... Scientific Network : 45000 MHz - 77824 MB - 1970 GB |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
No, since the Clients auto-pause, using ever-longer Time periods. Given the speed the current small batches are sent out, the majority of Clients won't even know there was work available (since they connect a day or eventually even a whole week too late in many cases). The unlucky clients don't decrease their connect interval, that's got to be right. But the lucky ones clearly do. If they don't get a full cache they are back to top up soon afterwards, the delay set by a config setting (as John Keck mentioned in another thread). If they do get a full cache they are back after they have crunched a fair amount of what they got, which may well be sooner than they would have come back on their previous backoff pattern. Either way the lucky box will come back sooner than it would have done if it had missed out. Even if it is *partly* a server issue, there has got to be some human/client side action that causes this to happen when work comes back but not in the gaps between work. My money is still on the human factor, but I admit I am not so sure as I was. R~~ |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Is English your first language? The use of the word "rule" in this context of volunteer computing, is often offensive to some. Perhaps you mean "suggestion." |
©2024 CERN