When you see work is around ...

Author	Message
Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 15213 - Posted: 28 Oct 2006, 5:29:25 UTC - in response to Message 15206. ... They should still be on a 4hr backoff. .... R~~ Some of us here in the Americas work as much as - well 8 hours in a day - some even more;-) so even with a 4 hour back off, they might all be ready to ask for work at the same time... It looks like the admins are limiting concurrent connections. Isn't that what the 1040 error implies? ID: 15213 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15214 - Posted: 28 Oct 2006, 7:04:45 UTC - in response to Message 15212. .... But if you -- any reader -- are tempted to do this, please at least keep rule 2 and 3 (see my original post). River~~ Is English your first language? The use of the word "rule" in this context of volunteer computing, is often offensive to some. Perhaps you mean "suggestion." Perhaps you can guess my native language if you look at my a/c details (suggestion: click my name on the left of the posting). I meant rule, especially for rule 3. Of course I do not have authority to make/enforce rules here - so at most they could only ever be suggested rules. But I most certainly meant something more strong than a mere suggestion by the time we get to rule 3. To make a client hammer a server that is already complaining of too many connections is either an act of ignorance (which the prior content of my first post was intended to dispel) or of sheer selfishness (putting the imagined personal advantage to my client in advance of the damage to everyone else). Rules are about preventing harm. Suggestions are about how to do things better when harm is not an issue. Example: it is reasonable to suggest that we all be polite to someone who stops us in the street to ask directions; not mugging them requires a rule. Kicking the server once it is already down (rule 3) spoils things for all the other volunteers. It cannot possibly help anything and may well delay the recovery process. It deserves to have a rule. Or so I am suggesting ;-) R~~ ID: 15214 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15215 - Posted: 28 Oct 2006, 7:21:50 UTC - in response to Message 15213. Last modified: 28 Oct 2006, 7:42:54 UTC It looks like the admins are limiting concurrent connections. Isn't that what the 1040 error implies? Absolutely so. Error 1040 comes from mySQL and if I remember rightly the number of connections is set in the MySQL config file. But my point is that if they doubled, or even multiplied the limit by 10, or x100, the rush for work might well still overwhelm it. And at some point there are technical limits to how big the admins can set that number. When the circuit breakers trip, a bigger circuit breaker may be on the list to think about, but it is not the first thing you try. FalconFly's experience of other projects may be misleading: Handling ~6000 active hosts should take a less powerful box and more restrictive settings than when handling >>10,000 active hosts as several other projects do (eg SETI 327,000 active hosts, Einstein 87,000, Rosetta 71,000; from BoincStats 28 Oct) Having said that, it may be that a higher limit than at present is more suitable. I agree it is one thing to be looked at (when we have some admins, of course) It is even possible that the limit may not even have been adjusted from the MySQL install default when they built the new server. Is anyone clear whether the server breaks started in August (when we had small work releases on the old server) or not till September (after the new servers were installed) ? If the symptoms only appeared after 5th September then I accept it does look like a server config issue; otherwise I still say there is more to it than only admin settings. River~~ ID: 15215 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15223 - Posted: 29 Oct 2006, 8:10:31 UTC - in response to Message 15218. ... It seems that when the 'too many connection' issue is present, the server status is also unavailable, thus the table shows a "n/a" during those periods. ... Closer following of the status shows it lags a little behind the actual failure - so the table (and the gaps in the graphs) give an indication of fail/resume times but the actual fail/resume times could be earlier by up to (I guess) 15min. That is good enough for most needs. R~~ ID: 15223 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15232 - Posted: 29 Oct 2006, 14:14:15 UTC Today's events (Sunday 29th Oct) rule out one explanation: that the trigger of the error 1040 too many connections is the automated "Work only between" setttings. Today the server went onto too many connections at around the middle of the day UTC which on a Sunday is unlikley to be a time when automated settings are relevant. The down time followed the introduction of more work, meaning that either of the following two explanations are still plausible - people reactig to seeing work on the server - clients automatically reacting to getting less than a full cache full In fairness to those suggesting the latter, 8 of my clients did get work (one task / cpu) and all of them went back inside the same minute to get more, one of them actually getting a third task for it 2 cpus. I had not realised that BOINC clients could ever go back in within the same minute - and having seen that happen am coming round to this as a possible cause. River~~ ID: 15232 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15233 - Posted: 29 Oct 2006, 14:27:16 UTC Today's events (Sunday 29th Oct) rule out one explanation: that the trigger of the error 1040 too many connections is the automated "Work only between" setttings. Today the server went onto too many connections at around the middle of the day UTC which on a Sunday is unlikley to be a time when automated settings are relevant. The down time followed the introduction of more work, meaning that either of the following two explanations are still plausible - people reactig to seeing work on the server - clients automatically reacting to getting less than a full cache full In fairness to those suggesting the latter, 8 of my clients did get work (one task / cpu) and all of them went back inside the same minute to get more, one of them actually getting a third task for it 2 cpus. I had not realised that BOINC clients could ever go back in within the same minute - and having seen that happen am coming round to this as a possible cause. River~~ ID: 15233 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15234 - Posted: 29 Oct 2006, 14:28:46 UTC Today's events (Sunday 29th Oct) rule out one explanation: that the trigger of the error 1040 too many connections is the automated "Work only between" setttings. Today the server went onto too many connections at around the middle of the day UTC which on a Sunday is unlikley to be a time when automated settings are relevant. The down time followed the introduction of more work, meaning that either of the following two explanations are still plausible - people reactig to seeing work on the server - clients automatically reacting to getting less than a full cache full In fairness to those suggesting the latter, 8 of my clients did get work (one task / cpu) and all of them went back inside the same minute to get more, one of them actually getting a third task for it 2 cpus. I had not realised that BOINC clients could ever go back in within the same minute - and having seen that happen am coming round to this as a possible cause. River~~ ID: 15234 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 15235 - Posted: 30 Oct 2006, 3:33:54 UTC - in response to Message 15215. Last modified: 30 Oct 2006, 3:38:14 UTC ...Absolutely so. Error 1040 comes from mySQL and if I remember rightly the number of connections is set in the MySQL config file. But my point is that if they doubled, or even multiplied the limit by 10, or x100, the rush for work might well still overwhelm it. .... Why should they (the admins) INCREASE the number of connections? If, as you assert, the server is failing due to overload, then reducing load by throttling the number of concurrent connections is "the right thing to do." The clients will pick up the work as they are able to connect. ID: 15235 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15244 - Posted: 30 Oct 2006, 7:12:07 UTC - in response to Message 15235. ...Absolutely so. Error 1040 comes from mySQL and if I remember rightly the number of connections is set in the MySQL config file. But my point is that if they doubled, or even multiplied the limit by 10, or x100, the rush for work might well still overwhelm it. .... Why should they (the admins) INCREASE the number of connections? If, as you assert, the server is failing due to overload, then reducing load by throttling the number of concurrent connections is "the right thing to do." That was exactly what I was trying to say. The problem is not with the circuit breaker, it is elsewhere. Even if the issue is a server one the solution needs to be one that reduces the connections, not one that allows more in. R~~ ID: 15244 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15245 - Posted: 30 Oct 2006, 16:46:52 UTC Yesterday's events offer some confort to both sides of the debate, in my opinion. One observation strengthens my idea that user behaviour is a factor in the connections problem; one other observation makes the idea that it is client led more plausible. The fact that the connection problem lasted all afternoon and evening UTC (which we have not seen on work days) suggests to me that it is user related. Although it is true that clients set to avoid work hours would have been running, there would not have been the bunching effect claimed for the evenings when many clients come back on at around the same time. "Avoid work hours" clients would preumably have been running since late Friday and the random backofs should have evened out this load. Earlier that day from around 0900 UTC onwards there were occasional "connections exceeded" messages from the website, so my impression is that the system was in fact running close to the edge all day. This is supported by the fact that even when numbers are shown in Scarecrow's graphs/table they do not change all morning. So the pattern of outages supports my view that the problem is triggered by users. On the other hand: 8 of my clients did get work that morning (one task / cpu) and each one went back to get work inside the same minute as it had already got the first WU. I had not see a client go back in under 1 minute before and had assumed the client would prevent this. It seems that when work is issued this server lets the client come back as soon as the file downloads are complete - these took around 40sec on each of my clients. (btw - we are talking about a different minute for each client!) The fact that this happens certainly offers a technical explanation that supports the idea that the issue is server led, and supports a server config change to fix it. The config tweaks suggested by John Keck in the "Fairer Distribution of Work" thread look all the more useful, as these would prevent clients coming back within 10 min or so. So mixed evidence, at least in my analysis. Overall I am now more open to the idea it is server led, but still not convinced. River~~ ID: 15245 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15249 - Posted: 30 Oct 2006, 18:21:05 UTC Sorry for the multiple posting everyone. I have just discovered that I was posting onto an offline copy of the forum under the impression that my first three attmepts had not posted successfully. Anyway Mondays long outage has finally convinced me that this is not a human issue, nor an issue of clients set not to crunch during working hours. So that leaves some kind of server problem, or a more general client problem (perhaps related to the way clients can get back in for a second helping inside a minute). River~~ ID: 15249 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 15257 - Posted: 31 Oct 2006, 4:44:28 UTC - in response to Message 15249. Sorry for the multiple posting everyone. I have just discovered that I was posting onto an offline copy of the forum under the impression that my first three attmepts had not posted successfully. Anyway Mondays long outage has finally convinced me that this is not a human issue, nor an issue of clients set not to crunch during working hours. So that leaves some kind of server problem, or a more general client problem (perhaps related to the way clients can get back in for a second helping inside a minute). River~~ or, it could be that the admins are simply limiting connections to match capacity... ID: 15257 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15258 - Posted: 31 Oct 2006, 5:59:21 UTC - in response to Message 15257. or, it could be that the admins are simply limiting connections to match capacity... Set it and leave it is the usual BOINC way - especially when there are no dedicated admins. I'd be surprised if the settings have been chenged in the last fortnight. If you mean that the settings are already as they are to match capacity, then it is still odd that the problem seems worse now than it has been in the past. Last Easter, for example, we did not have a nearly 2-day outage when work came onto the server. With new shiny servers I'd expect capacity to have gone up if anything, not down. R~~ ID: 15258 · Reply Quote

Ingleside Send message Joined: 1 Sep 04 Posts: 36 Credit: 78,199 RAC: 0	Message 15268 - Posted: 31 Oct 2006, 18:59:25 UTC - in response to Message 15245. 8 of my clients did get work that morning (one task / cpu) and each one went back to get work inside the same minute as it had already got the first WU. I had not see a client go back in under 1 minute before and had assumed the client would prevent this. It seems that when work is issued this server lets the client come back as soon as the file downloads are complete - these took around 40sec on each of my clients. (btw - we are talking about a different minute for each client!) For any failed connection, uploads, downloads or scheduler-requests, BOINC-client uses a random backoff, this backoff is between 1 minute and 4 hours. Also, if there's been 10 failed Scheduler-requests in succession, client tries to re-download the web-page (master-URL), and if this fails you'll get a 7-days deferral in old clients, but atleast in v5.6.xx and later the deferral is only 24 hours. For Successful Scheduler-replies on the other hand, there is some aditional rules: 1; If gets work, or didn't ask for work: <min_sendwork_interval> * 1.01, the 1.01 in case of user-clock runs a fractional faster than server-clock. 2; If asked but failed to get work: whatever is highest of <min_sendwork_interval> * 1.01 and clients random backoff. 3; Various server-side error-conditions: 1 hour. 4; Reached daily quota: midnight server-time + upto 1 hour randomly. 5; Don't meet system-requirements, like OS, min_boinc_client, usable disk-space or memory-requirement: 24 hours. 6; If expected run-time is so long, example due to low resource-share or cache-size higher than deadline, that can't finish 1 more result: can get upto 48 hours deferral, the size depends on cache-size and so on. For LHC@home, #2 is randomly between 1 minute - 4 hours, and is the common behaviour then no work available. #1 on the other hand is only 7 seconds, meaning if a client gets atleast 1 Task, but not enough to fill-up to cache-size, the same client can re-ask for more work 7 seconds later... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 15268 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15270 - Posted: 31 Oct 2006, 21:17:13 UTC - in response to Message 15268. Hi Ingleside, thanks for the useful injection of fact. #1, time between successful send of work and the next, is clearly too low. At the smallest this should be about the time it takes to crunch a short task on a fast box, and on a project with sparse work this could afford to be longer (as all clients, fast and slow, should have other projects' work anyway if they are on LHC). R~~ ID: 15270 · Reply Quote

Ingleside Send message Joined: 1 Sep 04 Posts: 36 Credit: 78,199 RAC: 0	Message 15279 - Posted: 1 Nov 2006, 3:10:47 UTC - in response to Message 15270. Hi Ingleside, thanks for the useful injection of fact. #1, time between successful send of work and the next, is clearly too low. At the smallest this should be about the time it takes to crunch a short task on a fast box, and on a project with sparse work this could afford to be longer (as all clients, fast and slow, should have other projects' work anyway if they are on LHC). R~~ Well, with some work taking less than a minute, not sure how long the delay should be... Looking on other projects, the 2 longest between RPC is WCG at 5 minutes and SETI at 10 minutes. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 15279 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 15281 - Posted: 1 Nov 2006, 5:09:50 UTC - in response to Message 15270. Hi Ingleside, thanks for the useful injection of fact. #1, time between successful send of work and the next, is clearly too low. At the smallest this should be about the time it takes to crunch a short task on a fast box, and on a project with sparse work this could afford to be longer (as all clients, fast and slow, should have other projects' work anyway if they are on LHC). R~~ Is this based on your goal of distributing work as widely as possible among as many different machines as possible? ID: 15281 · Reply Quote

River~~ Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0	Message 15283 - Posted: 1 Nov 2006, 5:33:29 UTC - in response to Message 15281. Last modified: 1 Nov 2006, 5:40:20 UTC Hi Ingleside, thanks for the useful injection of fact. #1, time between successful send of work and the next, is clearly too low. At the smallest this should be about the time it takes to crunch a short task on a fast box, and on a project with sparse work this could afford to be longer (as all clients, fast and slow, should have other projects' work anyway if they are on LHC). R~~ Is this based on your goal of distributing work as widely as possible among as many different machines as possible? That is another reason for extending the interval, but not the reason here. The reason here is that a server can get more work done if it is not driven to the cut-off point. The process of cutting off loses more service than would be lost by a controlled slow down. That is the design purpose of the setting Ingleside pointed us to. The cure for the connection limit problem is to extend the interval between client connections, starting with the most rapid connections that happen in the circumstances where we see the limit being exceeded. May I give you an analogy? There is a motorway, the M25, that runs all the way round London, England. The Department of Transport have doubled its capacity in peak times by slowing the traffic down. Because the new speed limits keep cars moving, more cars per hour get through in the rush hour than used to when everyone was legally allowed to go faster but was actually obstructed by tail backs. To come back to this issue, it might look like being asked to wait 300sec instead of 7 will slow things down for those clients that got work; but if it saves the problem of the half-day outages then things will actually speed up, for those clients and for everyone else. There is no point going back 7 sec later if the db is dead by then. Keeping the db alive 24/7 would also be less annoying for everyone. R~~ ID: 15283 · Reply Quote

Skip Da Shu Send message Joined: 2 Sep 04 Posts: 33 Credit: 2,057,517 RAC: 0	Message 15284 - Posted: 1 Nov 2006, 6:36:15 UTC - in response to Message 15283. Last modified: 1 Nov 2006, 6:42:12 UTC May I give you an analogy? There is a motorway, the M25, that runs all the way round London, England. The Department of Transport have doubled its capacity in peak times by slowing the traffic down. Because the new speed limits keep cars moving, more cars per hour get through in the rush hour than used to when everyone was legally allowed to go faster but was actually obstructed by tail backs. R~~ U were doing good till u got to "tail backs". The colonials will not understand "tail backs". Heck bubba, Us'n down here in the good ol' Republic of Texas can nary figure it. ;-) - da shu @ HeliOS, "Free softwareâ€ is a matter of liberty, not price. To understand the concept, you should think of â€œfreeâ€ as in â€œfree speechâ€, not as in â€œfree beer". ID: 15284 · Reply Quote

Galeon 7 Send message Joined: 24 Nov 05 Posts: 12 Credit: 8,333,730 RAC: 0	Message 15345 - Posted: 3 Nov 2006, 22:33:32 UTC - in response to Message 15284. May I give you an analogy? There is a motorway, the M25, that runs all the way round London, England. The Department of Transport have doubled its capacity in peak times by slowing the traffic down. Because the new speed limits keep cars moving, more cars per hour get through in the rush hour than used to when everyone was legally allowed to go faster but was actually obstructed by tail backs. R~~ U were doing good till u got to "tail backs". The colonials will not understand "tail backs". Heck bubba, Us'n down here in the good ol' Republic of Texas can nary figure it. ;-) Great, just great! I grew up in Texas and don't understand what nary means. :( I am a Geek. We surf in the dark web pages no others will enter. We stand on the router and no one may pass. We live for the Web. We die for the Web. ID: 15345 · Reply Quote

LHC@home