Thread 'I think we should restrict work units'

Author	Message
Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 13785 - Posted: 29 May 2006, 18:05:10 UTC My word - we are all sensitive today. No offence intended here, and none taken. However, I will say this: If the cap fits, wear it! Happy crunching! Gaspode the UnDressed http://www.littlevale.co.uk ID: 13785 · Reply Quote

The Gas Giant Send message Joined: 2 Sep 04 Posts: 309 Credit: 715,258 RAC: 0	Message 13786 - Posted: 29 May 2006, 23:28:56 UTC Last modified: 29 May 2006, 23:29:55 UTC If it was such a big problem for the project, then surely they would reduce the deadline. Reduce it down to 3, 4 or 5 days and caches will be smaller by default. The great wu snap up might then only occur on the 4th or 5th day. Resulting in wu's not coming back until 10 to 11 days after the first ones were issued. So this would be no benefit for the project. Releasing all the wu's on day 1 with the same 7 day deadline means that they should be received within 8 to 9 days of them first being released - noting that the project cache takes 48hrs to run down. If it is such a big problem for the project (and it really doesn't appear to be) then the best way to release the wu's is in smaller lots of say 20,000 results (that's 4,000 wu's) per day. Oh hang on...the project would be worse off. It would take 6 days to release all the wu's and if they had a 5 day deadline that might mean the last ones aren't returned until 11 days after the first ones are released. Also remember that the project does not "care" about us volunteers we are purely a resource. If the wu's are getting completed to the requirements of the project then why should they care if some people do not get the number of wu's they want. This is only a problem for the volunteers because people's imaginations have been piqued by participating in a big engineering/physics project so therefore want to crunch this project as much as possible and wu's are limited. Also don't forget that Chrulle worked on the "best" way to ensure the wu's are returned the quickest overall. He appears to have hit on it. Live long and crunch (if you've got 'em). Paul (S@H1 8888) BOINC/SAH BETA ID: 13786 · Reply Quote

m.mitch Send message Joined: 4 Sep 05 Posts: 112 Credit: 2,371,477 RAC: 3,414	Message 13790 - Posted: 30 May 2006, 16:06:12 UTC - in response to Message 13785. My word - we are all sensitive today. No offence intended here, and none taken. However, I will say this: If the cap fits, wear it! Happy crunching! I don't mind wearing it ;-) But I don't see the problem with people who crunch projects for the credits or any other reason, so long as we have fun. I enjoy the whole thing. The team the credits, the science, making new friends and sharing ideas across so many boundaries and it helps someone. Perhaps many someones. And it keeps my computers off the street ;-) Click here to join the #1 Aussie Alliance on LHC. ID: 13790 · Reply Quote

m.mitch Send message Joined: 4 Sep 05 Posts: 112 Credit: 2,371,477 RAC: 3,414	Message 13792 - Posted: 30 May 2006, 16:28:05 UTC I think you've got the right idea there Gas Giant. If there was a problem, the project would have changed something by now. I don't know that the staff don't care about us, I think Chrulle said they are between grad students at the moment. It may take a while for things to show up but one of the main reasons I liked LHC so much was that project staff were involved with the message boards. So that's how you spell "piqued", I'll have to keep a copy of that ;-). Click here to join the #1 Aussie Alliance on LHC. ID: 13792 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 13804 - Posted: 1 Jun 2006, 3:41:51 UTC Does anyone think that the reason the initial replication is 5 while the quorum is only 3 is to generate extra work for all the work hungry volunteers? ID: 13804 · Reply Quote

The Gas Giant Send message Joined: 2 Sep 04 Posts: 309 Credit: 715,258 RAC: 0	Message 13805 - Posted: 1 Jun 2006, 4:21:07 UTC - in response to Message 13804. Does anyone think that the reason the initial replication is 5 while the quorum is only 3 is to generate extra work for all the work hungry volunteers? I would have thought a replication of 4 would have been sufficient. ID: 13805 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 13806 - Posted: 1 Jun 2006, 5:21:43 UTC - in response to Message 13804. Does anyone think that the reason the initial replication is 5 while the quorum is only 3 is to generate extra work for all the work hungry volunteers? The five/three ratio is to improve the chances of getting a quorum at the first attempt. It's down to SixTrack's extreme sensitivity to numerical accuracy. In aven the most solid computer there can be the occasional single bit error that will throw the result off. Sending five results should improve the chance of a reaching a quorum, and so reduce the completion time for the study. From what I see on the results pages, most results reach quorum at three, so a replication of five is redundant. I'd like to know if the fourth and fifth results are still issued if a quorum has already been reached. Gaspode the UnDressed http://www.littlevale.co.uk ID: 13806 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 13807 - Posted: 1 Jun 2006, 5:41:47 UTC - in response to Message 13805. I would have thought a replication of 4 would have been sufficient. It cuts down significantly on the number of times they have to send the work unit back out to be crunched. Unlike Seti or Climate prediction, this project has a higher number of results that don't verify against other results for various reasons. I'm not the LHC Alex. Just a number cruncher like everyone else here. ID: 13807 · Reply Quote

John Hunt Send message Joined: 13 Jul 05 Posts: 133 Credit: 162,641 RAC: 0	Message 13808 - Posted: 1 Jun 2006, 6:29:01 UTC - in response to Message 13790. .............. so long as we have fun. I enjoy the whole thing. The team the credits, the science, making new friends and sharing ideas across so many boundaries and it helps someone. Perhaps many someones. And it keeps my computers off the street ;-) Sums it up in a nutshell for me........ Well said, Mike! ID: 13808 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 13809 - Posted: 1 Jun 2006, 7:22:02 UTC - in response to Message 13806. Does anyone think that the reason the initial replication is 5 while the quorum is only 3 is to generate extra work for all the work hungry volunteers? The five/three ratio is to improve the chances of getting a quorum at the first attempt. It's down to SixTrack's extreme sensitivity to numerical accuracy. In aven the most solid computer there can be the occasional single bit error that will throw the result off. Sending five results should improve the chance of a reaching a quorum, and so reduce the completion time for the study. From what I see on the results pages, most results reach quorum at three, so a replication of five is redundant. I'd like to know if the fourth and fifth results are still issued if a quorum has already been reached. What do you think the probabilty is of a single bit (or any other) error causing the same incorrect answer in even TWO of the three members of the quorum? ID: 13809 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 13810 - Posted: 1 Jun 2006, 8:17:37 UTC - in response to Message 13809. What do you think the probabilty is of a single bit (or any other) error causing the same incorrect answer in even TWO of the three members of the quorum? Extremely small, I'd guess. Sixtrack suffers from the single-bit sensitivity because of the way it handles its numbers, and the fact that it does the operations repeatedly. A single bit error in the first iteration of an algorithm will generate a different erroneous result than the same error occuring at, say, iteration 500,000. Given that a single bit problem can creep in potentially anywhere (and anywhen), the chances of two different computers generating the same incorrect result are vanishingly small. The same can't be said of the same computer running the same unit twice, however. It is possible that some sort of systematic failure could generate consistent errors at consistent points in the algorithm. Such a computer would probably never generate a valid LHC result, although it might work perfectly well in every other regard. Gaspode the UnDressed http://www.littlevale.co.uk ID: 13810 · Reply Quote

m.mitch Send message Joined: 4 Sep 05 Posts: 112 Credit: 2,371,477 RAC: 3,414	Message 13813 - Posted: 1 Jun 2006, 15:08:03 UTC - in response to Message 13808. .............. so long as we have fun. I enjoy the whole thing. The team the credits, the science, making new friends and sharing ideas across so many boundaries and it helps someone. Perhaps many someones. And it keeps my computers off the street ;-) Sums it up in a nutshell for me........ Well said, Mike! I thought so too. Thankyou 8-) Click here to join the #1 Aussie Alliance on LHC. ID: 13813 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 13821 - Posted: 2 Jun 2006, 6:59:29 UTC - in response to Message 13810. What do you think the probabilty is of a single bit (or any other) error causing the same incorrect answer in even TWO of the three members of the quorum? Extremely small, I'd guess. Sixtrack suffers from the single-bit sensitivity because of the way it handles its numbers, and the fact that it does the operations repeatedly. A single bit error in the first iteration of an algorithm will generate a different erroneous result than the same error occuring at, say, iteration 500,000. Given that a single bit problem can creep in potentially anywhere (and anywhen), the chances of two different computers generating the same incorrect result are vanishingly small. The same can't be said of the same computer running the same unit twice, however. It is possible that some sort of systematic failure could generate consistent errors at consistent points in the algorithm. Such a computer would probably never generate a valid LHC result, although it might work perfectly well in every other regard. for what it is worth, I have error detecting and correcting memory on my machine. I wonder how typical that is anymore... One of the LHC discussions mentioned the development of libraries that were able to return consistent results on different machines. If those libraries are used, then it seems a quorum of 2 with replication of 3 would suffice. But, since the computer resource is "free" and folk often clamor for "more work," it probably leads to higher quorums and higher initial replications. Has there been any discussion of giving "bonus points" for work units that are finished "quickly". It would seem this would be useful when errors from the initial replication group necessitated the resending of workunits closer to the deadline... ID: 13821 · Reply Quote

m.mitch Send message Joined: 4 Sep 05 Posts: 112 Credit: 2,371,477 RAC: 3,414	Message 13828 - Posted: 2 Jun 2006, 10:34:58 UTC - in response to Message 13821. .... [snip] ..... for what it is worth, I have error detecting and correcting memory on my machine. I wonder how typical that is anymore... Common on all servers of all sizes. Click here to join the #1 Aussie Alliance on LHC. ID: 13828 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 13832 - Posted: 2 Jun 2006, 13:26:27 UTC - in response to Message 13828. .... [snip] ..... for what it is worth, I have error detecting and correcting memory on my machine. I wonder how typical that is anymore... Common on all servers of all sizes. sure, but I meant how common is it among the BOINC or LHC crunchers. ID: 13832 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 13862 - Posted: 3 Jun 2006, 18:28:48 UTC - in response to Message 13376. I love LHC, and I realize it's different from the other BOINC projects in that it doesn't have continuous work to send out. It sends out work, and analyzes those results before sending out the next batch. I notice that this is slowed down by a minority of users who set their caches to maximum. When the number of work units available hits zero, we still have to wait a week or more while the people who grab a maximum number of units empty their cache before the scientists can even begin the analyzing process. That doesn't help the project - that's greed by people who want the most LHC units. When the number of available units hits zero, the scientists shouldn't have to wait more than a day or two. I suggest that the project limit the number of work units per computer to 2-3 at any given time. That way, as soon as all the work is sent out LHC will get them all back very soon after. Once a work unit is sent back, that computer can have another. This will speed up work-unit generation for all of us (my cache is set very low and every work unit I get is sent back within 12 hours, since I have other projects running too) since LHC scientists will get their work back faster and thus be able to create the next batch sooner. Matt - I want to thank you for taking the time to post this and start this thread. Prior to your having done so, I was have difficulty getting work units to run for LHC. Thanks to your clear explanation, I raised my cach for .01 to 10 days. And yup, As soon as there was work to do, I was able to get a bunch of it to work on. Again, thanks for your help in showing us how to get the maximum number of work units to process. Phil ID: 13862 · Reply Quote

John Hunt Send message Joined: 13 Jul 05 Posts: 133 Credit: 162,641 RAC: 0	Message 13864 - Posted: 3 Jun 2006, 18:55:42 UTC - in response to Message 13862. I notice that this is slowed down by a minority of users who set their caches to maximum........ Thanks to your clear explanation, I raised my cach for .01 to 10 days. And yup, As soon as there was work to do, I was able to get a bunch of it to work on. Again, thanks for your help in showing us how to get the maximum number of work units to process. Phil I set my cache to 1 day right back from when I started BOINCing....... and I received half-a-dozen WUs on the last distribution of work.... ID: 13864 · Reply Quote

Bob Guy Send message Joined: 28 Sep 05 Posts: 21 Credit: 11,715 RAC: 0	Message 13865 - Posted: 3 Jun 2006, 21:07:38 UTC - in response to Message 13821. for what it is worth, I have error detecting and correcting memory on my machine. I think the one-bit errors do not originate in the memory, the errors originate in the FPU/SSE. It is a known fault of the AMD cpus that the AMD FPU processes numbers differently (possibly less accurately) than the Intel FPU (this is usually overcome by proper program code). It is also a fact that overclocking can cause the FPU to be less accurate (the one-bit errors) for both AMD and Intel. One interesting and not well known feature of the FPU is that inside the FPU numbers are not represented as decimals as you might think. So, of course you say: they're binary! This is not true - the numbers are IEEE format for hardware design reasons. There are decimal numbers that can not be represented exactly in IEEE format. The numbers are 'close enough' for most purposes and special code is usually implemented to minimize error - the usual process is by extending precision and using careful rounding. At any rate, the errors introduced by IEEE format can be exagerrated by one-bit errors at or near the limits of precision. ID: 13865 · Reply Quote

Dronak Send message Joined: 19 May 06 Posts: 20 Credit: 297,111 RAC: 0	Message 13873 - Posted: 4 Jun 2006, 6:02:36 UTC - in response to Message 13862. I notice that this is slowed down by a minority of users who set their caches to maximum. When the number of work units available hits zero, we still have to wait a week or more while the people who grab a maximum number of units empty their cache before the scientists can even begin the analyzing process. That doesn't help the project - that's greed by people who want the most LHC units. Thanks to your clear explanation, I raised my cach for .01 to 10 days. And yup, As soon as there was work to do, I was able to get a bunch of it to work on. I'm sure MattDavis or someone will correct me if I'm wrong, but I thought the original post, quoted in part here, was saying that you shouldn't max out your cache. Doing that means you get a lot of work, true. But it also means that the work gets done slower because you're sitting on work that other people with a lower cache (getting work as they complete it) could be doing. Leaving some computers dry is not the best way to get work done promptly. It slows down the process and makes everyone wait longer to get more work. Wasn't that the whole point behind the original post and subject of limiting work units? To make sure that everyone gets a fair share, not to have some people hogging work for themselves while others' computers get left dry? ID: 13873 · Reply Quote

Philip Martin Kryder Send message Joined: 21 May 06 Posts: 73 Credit: 8,710 RAC: 0	Message 13876 - Posted: 4 Jun 2006, 9:10:04 UTC - in response to Message 13873. I notice that this is slowed down by a minority of users who set their caches to maximum. When the number of work units available hits zero, we still have to wait a week or more while the people who grab a maximum number of units empty their cache before the scientists can even begin the analyzing process. That doesn't help the project - that's greed by people who want the most LHC units. Thanks to your clear explanation, I raised my cach for .01 to 10 days. And yup, As soon as there was work to do, I was able to get a bunch of it to work on. I'm sure MattDavis or someone will correct me if I'm wrong, but I thought the original post, quoted in part here, was saying that you shouldn't max out your cache. Doing that means you get a lot of work, true. But it also means that the work gets done slower because you're sitting on work that other people with a lower cache (getting work as they complete it) could be doing. Leaving some computers dry is not the best way to get work done promptly. It slows down the process and makes everyone wait longer to get more work. Wasn't that the whole point behind the original post and subject of limiting work units? To make sure that everyone gets a fair share, not to have some people hogging work for themselves while others' computers get left dry? hmm - You mean that there may have been unintended consequences from starting this thread? Even so, I'm thankful for the idea. ID: 13876 · Reply Quote