Message boards :
Number crunching :
Did everyone get work 02 Nov UTC?
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
We had work on the server for 9 or 10 hours from just before midnight UTC on 1stNov till arounf 0930 UTC on 2nd, and no sign of connections issues. Looking at Scarecrow's Graphs it is obvious that work was being added to the server at around 8000 tasks/hour from 2300 to 0100 and at over double that rate from 0100 to 0200. During those three hours the server was filling faster than we could take the work off it; and there was enough work for it to keep handing it out for another 7 hours. The other interesting feature is that at around 0300 the rate of take up of work drops. Funnily enough, that is at around four hours after the work started to be issued, ie when every client on a 4 hour backoff has had one bite at the cherry. My interpretation of the slower rise after that is a combination of clients coming back after reporting completed work, and other boxes being added by users who don't like leaving their boxes asking for work from an empty project. All clients that have been left long term on the 4hr standoff cycle should therefore have had the chance to get work. Two reasons you might have missed out If you had got into a 1 day backoff, or a 7 day backoff, you might have missed the work - these only happen (I think) after network problems. Also if you had disabled LHC (suspended, detached, or set "no more work") you would have missed out if you sleep normal EU hours -- the lesson here is to leave the client asking for work, it costs almost nothing. So, apart from those two reasons, did anyone else miss out this time? River~~ |
Send message Joined: 2 Aug 05 Posts: 33 Credit: 2,329,729 RAC: 0 |
|
Send message Joined: 29 Nov 05 Posts: 8 Credit: 105,015 RAC: 0 |
Yeah, 'cos I need to sleep and work and I don't have my computers on 24/7 ... I'm such a lightweight! |
Send message Joined: 19 May 06 Posts: 20 Credit: 297,111 RAC: 0 |
I didn't completely miss out, no. But I only got 2 work units. With over 30000 still in progress right now, I would have hoped for more, but I suppose people with huge caches are hoarding the work (again, as usual). |
Send message Joined: 14 Jul 05 Posts: 41 Credit: 1,788,341 RAC: 0 |
Not so true ... I just take the good advice of leaving computers connected to the project. I got lucky and was suspending other projects for a different one and got a good bunch of LHC wu's. Now if we could just see that reflected in the stats ... that would be awesome! |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Yup - my cache was full with other projects.... no work for me this time. |
Send message Joined: 2 Sep 04 Posts: 4 Credit: 388,462 RAC: 0 |
I was able to snag a few wu's... still crunching! Yeah me!!! Thanks, Boogyman Munster |
Send message Joined: 29 Sep 04 Posts: 25 Credit: 77,910 RAC: 0 |
I got from 1 to 3 WU's on every machine! All have 0.1 day cache, and all work on a bunch of projects. Happy, happy, joy, joy!!! |
Send message Joined: 13 Jul 05 Posts: 12 Credit: 8,463 RAC: 0 |
I got a few of them, but until they sort out XML stats, I'm not going to do much work here, even if there's unlimited work available. |
Send message Joined: 30 Sep 06 Posts: 9 Credit: 5,298 RAC: 0 |
I got a few of them, but until they sort out XML stats, I'm not going to do much work here, even if there's unlimited work available. In the same boat here. Got about 12 WUs, finished & returned them, and set lhcathome to No New Tasks, until stats is sorted. |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
I got a few of them, but until they sort out XML stats, I'm not going to do much work here, even if there's unlimited work available. I look at this differently - the stats will be there eventually and the work I do now will show whenever they are finally exported - and, of course, if some people continue to do work and others don't then when the stats come back they will be to my advantage. R~~ |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
....Yup - my cache was full with other projects.... John Keck advocates that wih N projects the max cache to keep all the projects hungry at all times is 0.4 * Deadline / N; I suggest a somewhat larger setting of Deadline / (N+2). Of course this is no help if you need a large cache for some other reason, nor for people on dial-ups that only connect a limited number of times in the day/week. So the two reasons have grown to six: 1. Client in 1-day or 7-day standoff (probably due to net problems which may have been local to the box) 2. Project suspended etc 3. Machine powered down 4. Prefs set to prevent network access during the entire period that work was available (nobody has said this yet, but the work release must have covered roughly a working day for *some* timezone, even tho this one missed both the US and EU working days) 5. Machine had a fault at just the time when the work came available 6. Machine in "No Work Fetch" mode due to work held for other projects. My comiserations for anyone who did not get work, and did anyone else not get work for a reason not listed in those six, please? R~~ |
Send message Joined: 27 Sep 04 Posts: 282 Credit: 1,415,417 RAC: 0 |
I got enough work .... :-) I won't mind more, my clients were always set to recieve work from LHC and I manually unlocked other projects to get a few WU's elsewhere.... <-- me = happy :-) |
Send message Joined: 2 Sep 04 Posts: 121 Credit: 592,214 RAC: 0 |
I missed it because I have a 2-3 day backoff myself and all machines have LHC on "Suspended" for the obvious problems with LTD. Unless I manually get to know or change the Status of LHC, I won't even notice there is work until it really flows normal again (so far I haven't seen anything from the Project that would earn it to be un-suspended again). IMHO there should be an additional BOINC error code ;) -error 9220 : Staff disconnected from Project - come back later Scientific Network : 45000 MHz - 77824 MB - 1970 GB |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
[quote].... John Keck advocates that wih N projects the max cache to keep all the projects hungry at all times is 0.4 * Deadline / N; I suggest a somewhat larger setting of Deadline / (N+2). ..../quote] Can you explain why these magic numbers were chosen? And what the difference between your suggestion and his is designed to optimize? Also, where does one find the value of Deadline? And finally, is N the number of currently ACTIVE (not suspended) projects or total projects? |
Send message Joined: 13 Jul 05 Posts: 456 Credit: 75,142 RAC: 0 |
.... I can explain mine. Ive never understood his - but I have tested his in practice with the results as described. It may be that his summarises experience rather than comes from theory. The theory behind my rule of thumb: Assume (contrary to fact) that we can vary D to keep out of NWF. We will drop this assumption later. The No Work Fetch algorithm is designed to make sure a deadline is not exceeded in the worst case where a task completes a zillisecond after a connect, and when the next connect is the full interval away. So there needs to be a gap of C to allow for that. There needs to be a gap of C to fit in a cachefull of new work we are about to download, as otherwise the client will refuse to ask for work, so we allow another C for that. Then I allow a full cache for every project. So we would expect to be in NWF if D < (N+2) * C. To stay out of NWF we want D > (N+2) * C. Now it is time to drop the false assumption that we can vary D. In fact D and N are givens, and C is the value we can control. Rearrange as an algebra exercise gives C < D / (N+2) Why then does my formula sometimes go into NWF when it is designed not to? Well the expected run times don't exactly fit the cache size, and the client will always go for the extra task that straddles the cache limit rather than stop short. Also a task may actually run for longer than it claimed. In either event the client goes into NWF, but only for a few hours as my formula puts is right on the margin. And what the difference between your suggestion and his is designed to optimize? Can't comment on the design - the practical outcome is that my formula dips briefly into NWF for a few hours quite often, and his doesn't. Whether John found his formula by experiment or theory I will leave him to say. John's formula allows more slack than mine, and this is clearly related to the fact that his is better than mine at keeping out of NWF entirely. Also, where does one find the value of Deadline? Look at a task that has just been downloaded from a project, and subtract the current date from the deadline. Or look at a task that has not yet reported on the project website, and work out (deadline date) - (date sent). On LHC it is currently just over 6.5 days, Rosetta currently 10 days (has been 7 and 14 in the last few months), Leiden 6 days -- but LHC and Rosetta do change from time to time so it is worth re-checking. As projects vary, D in the formula is the shortest of these. And finally, is N the number of currently ACTIVE (not suspended) projects or total projects? The number is the same, active = total. This is because formulae apply only when you leave the suspend button alone. Every time you suspend or unsuspend you break the long term debt / short term debt balance, and the client will behave oddly for a few days. You can drive BOINC hands on, or hands off, but need to decide to stick to one of the other for at least 2*D at a time. In particular if you use the suspend button to cure a NWF situation, you almost guarantee another in about C days time. Using either formula expect to see NWF a few times for the first 2*D till the client settles down to work the way it is meant to. Hope that helps. I appreciate the level of detail in your questions, here and in other threads, which indicate that you are really engaging with the points made. R~~ |
Send message Joined: 14 Jul 05 Posts: 35 Credit: 71,636 RAC: 0 |
Got NO LHC work at all. Why - Well I think it was because - After months of being on 24/7, I decided to switch off while I was away on holiday this last week. Bloody typical - All that work and I missed it. |
Send message Joined: 4 Sep 05 Posts: 112 Credit: 2,112,822 RAC: 2,167 |
Got NO LHC work at all. Colin, here are my two rules to give anyone the best chance at getting work units from any project: (1) set the cache to 2 hours. (1 day/12 hours = 0.0833333333333333333333333333333333334) It fits. (2) leave the PC's attached to the project. This way, you're banging on the project door every 2 hours and saying, "Are we there yet?" ;-) The response is binary! Click here to join the #1 Aussie Alliance on LHC. |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Well thank you for the kind words and the effort to explain. A lot to contemplate. Is it correct that Resource share is not considered at all in your formulas? I have several projects. One of them, SZTAKI seems not well behaved. Sometimes the WUs don't complete in anywhere near the estimated time. Sometimes they seem to not complete at all. So I suspend it "a lot" and run it every few weeks to see if it is doing any better. |
Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0 |
.... Why 2 hours instead of say 70 minutes? (.05) |
©2025 CERN