Message boards : Number crunching : Initial Replication
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Profile [AF>Futura Sciences>Linux] Thr...

Send message
Joined: 6 Mar 07
Posts: 8
Credit: 31,454
RAC: 0
Message 17645 - Posted: 31 Jul 2007, 14:49:04 UTC - in response to Message 17639.  
Last modified: 31 Jul 2007, 14:50:25 UTC

As an example I give you a wu from Einstein

Einstein unit

basically the wu was initially issued on 8th June but did not reach quorum until 27th July !! enought said?

The project admins/scientist are the only people who have the whole picture, we who contribute our resourses can only trust that they are choosing the correct
parameters. Thats not to say we cannot suggest what we think is better. Einstein can afford this kind of thing occasionally, LHC can't.

That's wrong. The quorum is reached the 16th when the third host send his result. So it tooks 9 days to complete the quorum.

For LHC, IR=5 doesn't create only wasting calculations.

If there is 20-25% of error computing, there is one question to ask : are those errors retorted on all the WU or localised on some "hard to compute" WU ?

If only on some hard to compute WU, the calculation of Alex is the good point of view, because you can quickly obtain a lot of compute error on one WU. If IR was 3, and you obtain 2 errors, you'll need to resend 2, you got the 5 replications and got either a lot of delay. Can LHC afford it ?

If not, that's a question of answer speed. As the project administrator said, we get job when scientists need it. Implicitely, he's saying that a short delay is needed to send back the WU. That's also why they can't / don't want to send work 24/7.

Sure there is a waste time computing. But it seems that it is the cost to pay for this project.
------
Thrr-Gilag Kee'rr

L'Alliance Francophone
ID: 17645 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 17646 - Posted: 31 Jul 2007, 14:58:19 UTC - in response to Message 17640.  

Einstein can afford this kind of thing occasionally, LHC can't.


Thanks for the opinion but I'll wait for LHC to confirm that statement before I believe it.

LHC has people waiting on the results to get physical work done (aligning the magnets). Einstein does not. Since LHC has real world deadlines, LHC has a problem with maximum turn around time for the WUs, therefore a higher initial replication is required. Einstein and S@H on the other hand are trying to get through a huge pile of work, and that leads to an initial replication to match the minimum quorum.

This is documented in earlier threads if you care to go looking (try a couple of years ago).


BOINC WIKI
ID: 17646 · Report as offensive     Reply Quote
Profile DarkWaterSong

Send message
Joined: 5 Aug 05
Posts: 9
Credit: 3,991,070
RAC: 369
Message 17649 - Posted: 31 Jul 2007, 17:07:55 UTC

If you feel this project is such a waste of your time, there are other projects that you apparently don't have issues with. Why not just stick with those. That is normally what a person does if they don't like something, they move to something else.

I'm looking at the tasks this way: There have been multiple migration issues and the configs need to be tested. So if I am just running results to help with testing the hardware, I am fine with it. The high number of replicated tasks could easily be to check for patterns of errors.
ID: 17649 · Report as offensive     Reply Quote
larry1186

Send message
Joined: 4 Oct 06
Posts: 38
Credit: 24,908
RAC: 0
Message 17652 - Posted: 31 Jul 2007, 20:41:37 UTC - in response to Message 17647.  

...why make other work wait while you waste time crunching results that don't need to be crunched?


Even with the initial replication of 5, I don't see ANY work waiting on the servers here since it all gets distributed so fast, or are you referring to the work on your host? Sure somebody's host will be taking time away from some other project. Where are you trying to optimize useful results : work done? Considering only LHC, and all of it's odd work availability, it's OK to have a higher replication to reach quorum faster since it is unknown how fast certain hosts will be, and there is FAR more hosts available than there is work. Taking a step back and looking at all projects, LHC could be seen as (not in my opinion) a "resource hog" but that is what LHC admins/scientists have decided what they need in order to complete their task, which is to build a damn sweet machine. Taking a step in and looking at a single host, a result that is not needed for quorum would be wasted effort and redundant. Which is exactly why the new feature that Scarecrow mentioned came to fruition. Draw your line and take your stance.
ID: 17652 · Report as offensive     Reply Quote
Bob Guy

Send message
Joined: 28 Sep 05
Posts: 21
Credit: 11,715
RAC: 0
Message 17655 - Posted: 1 Aug 2007, 1:25:33 UTC
Last modified: 1 Aug 2007, 1:32:08 UTC

No one yet has mentioned an interesting coincidental phenomenon (or I've missed it).

The WUs will meet quorum (of 3) faster on average if the initial replication is 5. There is a much higher probability that 3 fast machines will get a particular WU (and complete it) when 5 rIDs are sent out.

Had the initial replication been 3, the probability is greater that one or two or all three of the computers would be slow ones thereby delaying the completion of quorum.

My conclusion is that the quorum is completed faster with 5 WUs than with 3 WUs.

It would be an improvement (in efficiency) if the server software could abort WUs whose results became redundant that hadn't already started on some of those slow computers. I'm also sure the owners of those computers might be distressed that some of their WUs were aborted and they lost the credit - this only being a problem with so little work available. The 5.10.x Boinc client can do this (aborting WUs) but it also requires a server upgrade, and I'm not sure all the bugs are out of the 5.10.xx client.
ID: 17655 · Report as offensive     Reply Quote
Profile Alex

Send message
Joined: 2 Sep 04
Posts: 378
Credit: 10,765
RAC: 0
Message 17656 - Posted: 1 Aug 2007, 1:57:32 UTC - in response to Message 17655.  

No one yet has mentioned an interesting coincidental phenomenon (or I've missed it).

The WUs will meet quorum (of 3) faster on average if the initial replication is 5. There is a much higher probability that 3 fast machines will get a particular WU (and complete it) when 5 rIDs are sent out.

Had the initial replication been 3, the probability is greater that one or two or all three of the computers would be slow ones thereby delaying the completion of quorum.

My conclusion is that the quorum is completed faster with 5 WUs than with 3 WUs.

It would be an improvement (in efficiency) if the server software could abort WUs whose results became redundant that hadn't already started on some of those slow computers. I'm also sure the owners of those computers might be distressed that some of their WUs were aborted and they lost the credit - this only being a problem with so little work available. The 5.10.x Boinc client can do this (aborting WUs) but it also requires a server upgrade, and I'm not sure all the bugs are out of the 5.10.xx client.



Another feature of the IR of 5 is that for the cases where the Intel results don't match the AMD results, there's a higher probability that the initial group of 5 will have 3 of one pc and 2 of the other PC type.
If you had an IR of 4, and two Intel PC's giving different results from two AMD pc's you get a condition where you cannot validate the WU until you send out more units.


I'm not the LHC Alex. Just a number cruncher like everyone else here.
ID: 17656 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 24 Nov 06
Posts: 76
Credit: 7,914,481
RAC: 27,114
Message 17659 - Posted: 1 Aug 2007, 3:08:05 UTC - in response to Message 17593.  
Last modified: 1 Aug 2007, 3:11:20 UTC

There's a mathematical reason that there's 5 sent out instead of 4.
20-25% of the crunching instances have errors. (I looked at the most recent work units that MY pc crunched, and there were 5 out of 27 client errors reported by all pc's crunching)


I'm going to need to challenge those numbers.

I looked at the 225 results I returned in july. There are still a handful still pending, but all the rest validated. 0 Errors.

Also, I looked at my last 20 WUs, and there were only 2 Errors out of the 100 results initially issued.

Now, I do not doubt the numbers you quoted were anything but 100% accurate. But I do not think that is a normal error rate.

If the error rate was that bad, then there is a problem with the application, the WUs, or both.
Dublin, California
Team: SETI.USA

ID: 17659 · Report as offensive     Reply Quote
Bob Guy

Send message
Joined: 28 Sep 05
Posts: 21
Credit: 11,715
RAC: 0
Message 17661 - Posted: 1 Aug 2007, 3:35:09 UTC

I think there is some concern that the AMD results don't match or might not match the Intel results. That's still a valid reason to get as many WUs run that have both AMD and Intel results represented for a single WU.
ID: 17661 · Report as offensive     Reply Quote
Profile Alex

Send message
Joined: 2 Sep 04
Posts: 378
Credit: 10,765
RAC: 0
Message 17662 - Posted: 1 Aug 2007, 4:05:43 UTC - in response to Message 17659.  



I'm going to need to challenge those numbers.

I looked at the 225 results I returned in july. There are still a handful still pending, but all the rest validated. 0 Errors.

Also, I looked at my last 20 WUs, and there were only 2 Errors out of the 100 results initially issued.



Looking at a single CPU isn't representative of the whole population.
My last 20 wu's got credit as well.


Now, I do not doubt the numbers you quoted were anything but 100% accurate. But I do not think that is a normal error rate.

If the error rate was that bad, then there is a problem with the application, the WUs, or both.


When I looked at the most recent WU's my pc crunched, I saw errors.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1692507 - no errors. 5 crunchers.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1584631 - 3 no replys. 8 crunchers.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1601064 - 1 no reply. 5 crunchers.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1594030 - 1 client error. 6 crunchers.

http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1592979 - 1 client error/compute error. 6 crunchers.

6 errors, 30 crunching instances. So, 20% with that sample.






I'm not the LHC Alex. Just a number cruncher like everyone else here.
ID: 17662 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 24 Nov 06
Posts: 76
Credit: 7,914,481
RAC: 27,114
Message 17665 - Posted: 1 Aug 2007, 5:59:07 UTC - in response to Message 17662.  
Last modified: 1 Aug 2007, 6:06:05 UTC


I'm going to need to challenge those numbers.

I looked at the 225 results I returned in july. There are still a handful still pending, but all the rest validated. 0 Errors.

Also, I looked at my last 20 WUs, and there were only 2 Errors out of the 100 results initially issued.



Looking at a single CPU isn't representative of the whole population.
My last 20 wu's got credit as well.


I don't get your point. If I was not clear, those 225 results were across 12 different machines, two of which are AMD, and the rest various flavors of Pentium D, C2D, and Xeon (both netburst and C2D based)

When I looked at the most recent WU's my pc crunched, I saw errors.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1692507 - no errors. 5 crunchers.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1584631 - 3 no replys. 8 crunchers.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1601064 - 1 no reply. 5 crunchers.
http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1594030 - 1 client error. 6 crunchers.

http://lhcathome.cern.ch/lhcathome/workunit.php?wuid=1592979 - 1 client error/compute error. 6 crunchers.

6 errors, 30 crunching instances. So, 20% with that sample.


Like I said, I don't doubt your numbers. I am saying that I doubt that is normal.
Dublin, California
Team: SETI.USA

ID: 17665 · Report as offensive     Reply Quote
Bob Guy

Send message
Joined: 28 Sep 05
Posts: 21
Credit: 11,715
RAC: 0
Message 17667 - Posted: 1 Aug 2007, 7:19:03 UTC - in response to Message 17666.  
Last modified: 1 Aug 2007, 7:23:17 UTC

Your observation is absolutely true.

Not efficient? Probably true.

Why we do it the way we're doing it? Probably easier to implement.

Are you saying that it's normal for you to doubt? (A deliberate misconstruction by me. I'm just teasing you!)
ID: 17667 · Report as offensive     Reply Quote
Profile [AF>Futura Sciences>Linux] Thr...

Send message
Joined: 6 Mar 07
Posts: 8
Credit: 31,454
RAC: 0
Message 17668 - Posted: 1 Aug 2007, 7:59:05 UTC

LHC have real work to do, and deadline to respect. Sure you can find more efficient ways to do this work. But how many times would they lost by developing and testing such new application ?

Now there is CPU wasting time. But I thinck it's cheaper to loss CPU time that to spend reflexion time of scientific who have to afford strict deadlines.

Don't you thinck so ?
------
Thrr-Gilag Kee'rr

L'Alliance Francophone
ID: 17668 · Report as offensive     Reply Quote
Mike Dunn

Send message
Joined: 15 Jul 05
Posts: 9
Credit: 770,253
RAC: 0
Message 17672 - Posted: 1 Aug 2007, 12:41:20 UTC - in response to Message 17657.  
Last modified: 1 Aug 2007, 12:51:24 UTC

I am assuming that one day in the not too distant future there will be a steady supply of LHC work or that the runs will become so big (or hosts so few) they won't be gobbled up in a few hours. I assume the short runs we see now are just test runs.

Really ?

You mean, we're NOT helping them work out alignments for all the hardware they are building ? That this is just practice for something else ? What is this 'something else', oh Great One, that you imply ?

Get real. Read about the aims of this project - or don't you consider looking up historical questions & answers important ?

They need the data NOW because they are modifying the collider systems NOW. Not in a few months or years - NOW. They make available bunches of WUs to deal with questions/scenarios that they need data on to make adjustments BEFORE they fire the collider up - I'd much rather they did this than fire it up & it either fails totally, or blows a massive hole in the ground ! If this means they send out 5 WUs instead of 3 or 4 - fine by me. The scientists get their data back quicker, which is what they need. Remember - this project is NOT for us; we just help out here.

Will we get WUs once the project is finished ? I doubt it - why would they need it, once the system is up & running ? Will we get WUs for a different project ? The hints are that we may. Will it be a lot of WUs ? Again, I doubt it.

So - if you don't like the way things are run, why not create your own project ? No-one is forcing you to stay here, after all.

{Edit}

To save you the hassle of actually looking for information, have a read of this :

In March 2005, huge superconducting dipole magnets began to be installed in the LHC tunnel. Every time a new magnet is installed, measurements are made of its properties. If it deviates significantly from the specified values, SixTrack will be required to study what impact, if any, this difference might have on the operations of the machine. Getting the results as soon as possible makes a big difference for the engineers installing the thousands magnets (1232 dipole magnets alone). A lot of number-crunching will be needed over the coming years to do this critical analysis. So your participation in LHC@home really does help to build the LHC!

Source is within this website, in the About section


ID: 17672 · Report as offensive     Reply Quote
Betting Slip

Send message
Joined: 17 Sep 04
Posts: 41
Credit: 27,497
RAC: 0
Message 17673 - Posted: 1 Aug 2007, 13:23:44 UTC - in response to Message 17672.  
Last modified: 1 Aug 2007, 13:32:12 UTC

I am assuming that one day in the not too distant future there will be a steady supply of LHC work or that the runs will become so big (or hosts so few) they won't be gobbled up in a few hours. I assume the short runs we see now are just test runs.

Really ?

You mean, we're NOT helping them work out alignments for all the hardware they are building ? That this is just practice for something else ? What is this 'something else', oh Great One, that you imply ?

Get real. Read about the aims of this project - or don't you consider looking up historical questions & answers important ?

They need the data NOW because they are modifying the collider systems NOW. Not in a few months or years - NOW. They make available bunches of WUs to deal with questions/scenarios that they need data on to make adjustments BEFORE they fire the collider up - I'd much rather they did this than fire it up & it either fails totally, or blows a massive hole in the ground ! If this means they send out 5 WUs instead of 3 or 4 - fine by me. The scientists get their data back quicker, which is what they need. Remember - this project is NOT for us; we just help out here.

Will we get WUs once the project is finished ? I doubt it - why would they need it, once the system is up & running ? Will we get WUs for a different project ? The hints are that we may. Will it be a lot of WUs ? Again, I doubt it.

So - if you don't like the way things are run, why not create your own project ? No-one is forcing you to stay here, after all.

{Edit}

To save you the hassle of actually looking for information, have a read of this :

In March 2005, huge superconducting dipole magnets began to be installed in the LHC tunnel. Every time a new magnet is installed, measurements are made of its properties. If it deviates significantly from the specified values, SixTrack will be required to study what impact, if any, this difference might have on the operations of the machine. Getting the results as soon as possible makes a big difference for the engineers installing the thousands magnets (1232 dipole magnets alone). A lot of number-crunching will be needed over the coming years to do this critical analysis. So your participation in LHC@home really does help to build the LHC!

Source is within this website, in the About section



I can't understand why you bothered to put fingers to keyboard to post that completely unhelpful diatribe.

This thread is not about whether this work needs doing or not, it's about the waste of FINITE donated resources!

Dagorath has posted some very compelling arguements in this thread and you respond with this.

ID: 17673 · Report as offensive     Reply Quote
Profile AstralWalker

Send message
Joined: 30 Nov 05
Posts: 14
Credit: 1,746,819
RAC: 0
Message 17676 - Posted: 1 Aug 2007, 18:40:07 UTC - in response to Message 17657.  

I am assuming that one day in the not too distant future there will be a steady supply of LHC work or that the runs will become so big (or hosts so few) they won't be gobbled up in a few hours.

While I haven't been with the project that long I don't remember this being true in the past and I wouldn't hold my breath that it will ever be true.
ID: 17676 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Initial Replication


©2024 CERN