Message boards : Number crunching : Did everyone get work 02 Nov UTC?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15314 - Posted: 2 Nov 2006, 18:54:49 UTC

We had work on the server for 9 or 10 hours from just before midnight UTC on 1stNov till arounf 0930 UTC on 2nd, and no sign of connections issues. Looking at Scarecrow's Graphs it is obvious that work was being added to the server at around 8000 tasks/hour from 2300 to 0100 and at over double that rate from 0100 to 0200. During those three hours the server was filling faster than we could take the work off it; and there was enough work for it to keep handing it out for another 7 hours.

The other interesting feature is that at around 0300 the rate of take up of work drops. Funnily enough, that is at around four hours after the work started to be issued, ie when every client on a 4 hour backoff has had one bite at the cherry.

My interpretation of the slower rise after that is a combination of clients coming back after reporting completed work, and other boxes being added by users who don't like leaving their boxes asking for work from an empty project.

All clients that have been left long term on the 4hr standoff cycle should therefore have had the chance to get work.

Two reasons you might have missed out

If you had got into a 1 day backoff, or a 7 day backoff, you might have missed the work - these only happen (I think) after network problems.

Also if you had disabled LHC (suspended, detached, or set "no more work") you would have missed out if you sleep normal EU hours -- the lesson here is to leave the client asking for work, it costs almost nothing.

So, apart from those two reasons, did anyone else miss out this time?

River~~
ID: 15314 · Report as offensive     Reply Quote
Andreas

Send message
Joined: 2 Aug 05
Posts: 33
Credit: 2,328,412
RAC: 3
Message 15316 - Posted: 2 Nov 2006, 20:07:14 UTC - in response to Message 15314.  
Last modified: 2 Nov 2006, 20:07:26 UTC


So, apart from those two reasons, did anyone else miss out this time?

River~~


I missed out, mostly because my processor died :-/
ID: 15316 · Report as offensive     Reply Quote
Rob Lilley

Send message
Joined: 29 Nov 05
Posts: 8
Credit: 105,015
RAC: 0
Message 15317 - Posted: 2 Nov 2006, 22:02:42 UTC - in response to Message 15316.  


So, apart from those two reasons, did anyone else miss out this time?

River~~


Yeah, 'cos I need to sleep and work and I don't have my computers on 24/7 ... I'm such a lightweight!
ID: 15317 · Report as offensive     Reply Quote
Dronak
Avatar

Send message
Joined: 19 May 06
Posts: 20
Credit: 297,111
RAC: 0
Message 15318 - Posted: 2 Nov 2006, 23:42:01 UTC

I didn't completely miss out, no. But I only got 2 work units. With over 30000 still in progress right now, I would have hoped for more, but I suppose people with huge caches are hoarding the work (again, as usual).
ID: 15318 · Report as offensive     Reply Quote
Profile [B^S] ShanerX

Send message
Joined: 14 Jul 05
Posts: 41
Credit: 1,788,341
RAC: 0
Message 15322 - Posted: 3 Nov 2006, 0:50:03 UTC

Not so true ... I just take the good advice of leaving computers connected to the project. I got lucky and was suspending other projects for a different one and got a good bunch of LHC wu's.

Now if we could just see that reflected in the stats ... that would be awesome!

ID: 15322 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 15327 - Posted: 3 Nov 2006, 2:19:53 UTC - in response to Message 15314.  

....
So, apart from those two reasons, did anyone else miss out this time?

River~~


Yup - my cache was full with other projects....
no work for me this time.




ID: 15327 · Report as offensive     Reply Quote
Profile Boogyman Munster
Avatar

Send message
Joined: 2 Sep 04
Posts: 4
Credit: 387,934
RAC: 81
Message 15328 - Posted: 3 Nov 2006, 2:30:43 UTC

I was able to snag a few wu's... still crunching! Yeah me!!!
Thanks,
Boogyman Munster



ID: 15328 · Report as offensive     Reply Quote
genes
Avatar

Send message
Joined: 29 Sep 04
Posts: 25
Credit: 77,863
RAC: 4
Message 15329 - Posted: 3 Nov 2006, 4:03:54 UTC

I got from 1 to 3 WU's on every machine! All have 0.1 day cache, and all work on a bunch of projects. Happy, happy, joy, joy!!!
ID: 15329 · Report as offensive     Reply Quote
Profile Webmaster Yoda

Send message
Joined: 13 Jul 05
Posts: 12
Credit: 8,463
RAC: 0
Message 15330 - Posted: 3 Nov 2006, 5:37:07 UTC

I got a few of them, but until they sort out XML stats, I'm not going to do much work here, even if there's unlimited work available.



ID: 15330 · Report as offensive     Reply Quote
darkclown

Send message
Joined: 30 Sep 06
Posts: 9
Credit: 5,298
RAC: 0
Message 15331 - Posted: 3 Nov 2006, 6:21:02 UTC - in response to Message 15330.  

I got a few of them, but until they sort out XML stats, I'm not going to do much work here, even if there's unlimited work available.



In the same boat here. Got about 12 WUs, finished & returned them, and set lhcathome to No New Tasks, until stats is sorted.
ID: 15331 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15332 - Posted: 3 Nov 2006, 6:29:08 UTC - in response to Message 15330.  

I got a few of them, but until they sort out XML stats, I'm not going to do much work here, even if there's unlimited work available.


I look at this differently - the stats will be there eventually and the work I do now will show whenever they are finally exported - and, of course, if some people continue to do work and others don't then when the stats come back they will be to my advantage.

R~~
ID: 15332 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15333 - Posted: 3 Nov 2006, 6:51:08 UTC - in response to Message 15327.  

....
So, apart from those two reasons, did anyone else miss out this time?
...
Yup - my cache was full with other projects....
no work for me this time.


John Keck advocates that wih N projects the max cache to keep all the projects hungry at all times is 0.4 * Deadline / N; I suggest a somewhat larger setting of Deadline / (N+2).

Of course this is no help if you need a large cache for some other reason, nor for people on dial-ups that only connect a limited number of times in the day/week.

So the two reasons have grown to six:

1. Client in 1-day or 7-day standoff (probably due to net problems which may have been local to the box)

2. Project suspended etc

3. Machine powered down

4. Prefs set to prevent network access during the entire period that work was available (nobody has said this yet, but the work release must have covered roughly a working day for *some* timezone, even tho this one missed both the US and EU working days)

5. Machine had a fault at just the time when the work came available

6. Machine in "No Work Fetch" mode due to work held for other projects.

My comiserations for anyone who did not get work, and did anyone else not get work for a reason not listed in those six, please?

R~~
ID: 15333 · Report as offensive     Reply Quote
Profile sysfried

Send message
Joined: 27 Sep 04
Posts: 282
Credit: 1,415,417
RAC: 0
Message 15337 - Posted: 3 Nov 2006, 10:26:27 UTC - in response to Message 15333.  

I got enough work .... :-) I won't mind more, my clients were always set to recieve work from LHC and I manually unlocked other projects to get a few WU's elsewhere....

<-- me = happy :-)
ID: 15337 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 15339 - Posted: 3 Nov 2006, 14:18:56 UTC - in response to Message 15337.  
Last modified: 3 Nov 2006, 14:22:20 UTC

I missed it because I have a 2-3 day backoff myself and all machines have LHC on "Suspended" for the obvious problems with LTD.

Unless I manually get to know or change the Status of LHC, I won't even notice there is work until it really flows normal again (so far I haven't seen anything from the Project that would earn it to be un-suspended again).

IMHO there should be an additional BOINC error code ;)
-error 9220 : Staff disconnected from Project - come back later
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 15339 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 15340 - Posted: 3 Nov 2006, 15:33:13 UTC - in response to Message 15333.  

[quote]....
John Keck advocates that wih N projects the max cache to keep all the projects hungry at all times is 0.4 * Deadline / N; I suggest a somewhat larger setting of Deadline / (N+2).
..../quote]

Can you explain why these magic numbers were chosen?

And what the difference between your suggestion and his is designed to optimize?

Also, where does one find the value of Deadline?

And finally, is N the number of currently ACTIVE (not suspended) projects or total projects?

ID: 15340 · Report as offensive     Reply Quote
River~~

Send message
Joined: 13 Jul 05
Posts: 456
Credit: 75,142
RAC: 0
Message 15342 - Posted: 3 Nov 2006, 19:13:48 UTC - in response to Message 15340.  
Last modified: 3 Nov 2006, 19:15:12 UTC

....
John Keck advocates that wih N projects the max cache to keep all the projects hungry at all times is 0.4 * Deadline / N; I suggest a somewhat larger setting of Deadline / (N+2).
....


Can you explain why these magic numbers were chosen?


I can explain mine. Ive never understood his - but I have tested his in practice with the results as described. It may be that his summarises experience rather than comes from theory.

The theory behind my rule of thumb: Assume (contrary to fact) that we can vary D to keep out of NWF. We will drop this assumption later.

The No Work Fetch algorithm is designed to make sure a deadline is not exceeded in the worst case where a task completes a zillisecond after a connect, and when the next connect is the full interval away. So there needs to be a gap of C to allow for that.

There needs to be a gap of C to fit in a cachefull of new work we are about to download, as otherwise the client will refuse to ask for work, so we allow another C for that.

Then I allow a full cache for every project. So we would expect to be in NWF if D < (N+2) * C. To stay out of NWF we want D > (N+2) * C.

Now it is time to drop the false assumption that we can vary D. In fact D and N are givens, and C is the value we can control. Rearrange as an algebra exercise gives C < D / (N+2)

Why then does my formula sometimes go into NWF when it is designed not to?

Well the expected run times don't exactly fit the cache size, and the client will always go for the extra task that straddles the cache limit rather than stop short. Also a task may actually run for longer than it claimed. In either event the client goes into NWF, but only for a few hours as my formula puts is right on the margin.

And what the difference between your suggestion and his is designed to optimize?


Can't comment on the design - the practical outcome is that my formula dips briefly into NWF for a few hours quite often, and his doesn't. Whether John found his formula by experiment or theory I will leave him to say.

John's formula allows more slack than mine, and this is clearly related to the fact that his is better than mine at keeping out of NWF entirely.

Also, where does one find the value of Deadline?


Look at a task that has just been downloaded from a project, and subtract the current date from the deadline. Or look at a task that has not yet reported on the project website, and work out (deadline date) - (date sent).

On LHC it is currently just over 6.5 days, Rosetta currently 10 days (has been 7 and 14 in the last few months), Leiden 6 days -- but LHC and Rosetta do change from time to time so it is worth re-checking. As projects vary, D in the formula is the shortest of these.

And finally, is N the number of currently ACTIVE (not suspended) projects or total projects?


The number is the same, active = total. This is because formulae apply only when you leave the suspend button alone.

Every time you suspend or unsuspend you break the long term debt / short term debt balance, and the client will behave oddly for a few days. You can drive BOINC hands on, or hands off, but need to decide to stick to one of the other for at least 2*D at a time. In particular if you use the suspend button to cure a NWF situation, you almost guarantee another in about C days time.

Using either formula expect to see NWF a few times for the first 2*D till the client settles down to work the way it is meant to.

Hope that helps. I appreciate the level of detail in your questions, here and in other threads, which indicate that you are really engaging with the points made. R~~
ID: 15342 · Report as offensive     Reply Quote
Colin Porter

Send message
Joined: 14 Jul 05
Posts: 35
Credit: 71,636
RAC: 0
Message 15343 - Posted: 3 Nov 2006, 20:32:10 UTC

Got NO LHC work at all.

Why - Well I think it was because - After months of being on 24/7, I decided to switch off while I was away on holiday this last week.

Bloody typical - All that work and I missed it.
ID: 15343 · Report as offensive     Reply Quote
m.mitch

Send message
Joined: 4 Sep 05
Posts: 112
Credit: 1,864,470
RAC: 0
Message 15346 - Posted: 4 Nov 2006, 1:17:25 UTC - in response to Message 15343.  
Last modified: 4 Nov 2006, 1:19:15 UTC

Got NO LHC work at all.

Why - Well I think it was because - After months of being on 24/7, I decided to switch off while I was away on holiday this last week.

Bloody typical - All that work and I missed it.


Colin, here are my two rules to give anyone the best chance at getting work units from any project:

(1) set the cache to 2 hours. (1 day/12 hours = 0.0833333333333333333333333333333333334) It fits.
(2) leave the PC's attached to the project.

This way, you're banging on the project door every 2 hours and saying, "Are we there yet?" ;-)

The response is binary!




Click here to join the #1 Aussie Alliance on LHC.
ID: 15346 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 15348 - Posted: 4 Nov 2006, 3:34:31 UTC - in response to Message 15342.  
Last modified: 4 Nov 2006, 3:37:59 UTC

....
Hope that helps. I appreciate the level of detail in your questions, here and in other threads, which indicate that you are really engaging with the points made. R~~


Well thank you for the kind words and the effort to explain.
A lot to contemplate.

Is it correct that Resource share is not considered at all in your formulas?


I have several projects.
One of them, SZTAKI seems not well behaved.
Sometimes the WUs don't complete in anywhere near the estimated time.
Sometimes they seem to not complete at all.
So I suspend it "a lot" and run it every few weeks to see if it is doing any better.




ID: 15348 · Report as offensive     Reply Quote
Philip Martin Kryder

Send message
Joined: 21 May 06
Posts: 73
Credit: 8,710
RAC: 0
Message 15349 - Posted: 4 Nov 2006, 3:38:59 UTC - in response to Message 15346.  
Last modified: 4 Nov 2006, 3:41:01 UTC

....
(1) set the cache to 2 hours. (1 day/12 hours = 0.0833333333333333333333333333333333334) It fits.
....


Why 2 hours instead of say 70 minutes? (.05)

ID: 15349 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Did everyone get work 02 Nov UTC?


©2024 CERN