@Markku - homogeneous redundancy

Author	Message
Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 4481 - Posted: 27 Oct 2004, 13:47:32 UTC Markku I just checked the scheduler source code. The homogeneous redundancy feature checks the processor type for the strings 'Intel', 'AMD' and 'Macintosh'. OS names checked are 'Windows', 'Linux', 'Darwin' and 'SunOS'. It seems likely that just turning on homogeneous redundancy will eliminate the sort of cross-platform inconsistencies you're getting at a stroke, and without setting up any new platforms. Rom Walton at Berkeley should be able to confirm this. Sorry if you've been here already. I just hope this is useful. Giskard - the first telepathic robot. ID: 4481 · Reply Quote

Toby Send message Joined: 1 Sep 04 Posts: 137 Credit: 1,769,043 RAC: 0	Message 4496 - Posted: 27 Oct 2004, 17:52:04 UTC Predictor was using this feature and it ended up biting them in the rear. For some reason their clients started erroring out on gentoo linux (and a couple other distros I think). So all connected hosts that were running gentoo suddenly started downloading the max 50 work units per day and returning them all as errors within seconds. But by then they had been flagged as 'linux' work units and could not be given out to computers with another OS. Eventually all the work units became "gentooized" and no one else was getting any work with the message (there was work but not for your platform). Just something to consider. The benefits may outweigh the risks in this case. Not sure if the BOINC admins have a mailing list where they discuss such matters... -------------------------------------- A member of The Knights Who Say Ni! My BOINC stats site ID: 4496 · Reply Quote

tiker Send message Joined: 1 Sep 04 Posts: 34 Credit: 19,096 RAC: 0	Message 4499 - Posted: 27 Oct 2004, 19:53:58 UTC - in response to Message 4496. Last modified: 27 Oct 2004, 19:55:35 UTC > Predictor was using this feature and it ended up biting them in the rear. For > some reason their clients started erroring out on gentoo linux (and a couple > other distros I think). So all connected hosts that were running gentoo > suddenly started downloading the max 50 work units per day and returning them > all as errors within seconds. But by then they had been flagged as 'linux' > work units and could not be given out to computers with another OS. > Eventually all the work units became "gentooized" and no one else was getting > any work with the message (there was work but not for your platform). That was caused by a bug with the Gentoo kernel if I remember correctly. There's enough linux boxes to crunch the data, the only problem with Predictor was the timing of the returned results. I'd like to see the homogeneous redundancy enabled. --- ID: 4499 · Reply Quote

Granite T. Rock Send message Joined: 24 Oct 04 Posts: 6 Credit: 67,366 RAC: 0	Message 4501 - Posted: 27 Oct 2004, 20:13:45 UTC This may be a silly question: If there is not cross-platform consistency would LHC not want to eliminate this inconsistancy by finding out the source of it rather than ensuring that the same work unit is processed by the same platform? My reasoning is that if two seperate platforms produce distinct and different results, then how I know which result or even if the results are valid? I could however, see how homogeneous redundancy being used as an aid in tracking down and understanding how the different platforms behave differently and use that information to make the sourcecode stronger in the future. Thanks! ID: 4501 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 4512 - Posted: 27 Oct 2004, 21:45:34 UTC The difficulty with eliminating the inconsistency is that the problem largely originates in the underlying hardware. Where hardware floating point maths is used (and it is used extensively since it is much faster than software fp maths) the fp operations often use lookup tables to improve the speed (Chebyshev constants). The lookup tables used can vary according to the algorithms used to generate them, and the level of accuracy to which they are calculated. These differences aren't significant unless the accuracy of your calculations is extreme. Sixtrack seems to be particularly sensitive, possible because of the large numbers of flops carried out (10^14 flops is common). The BOINC validator needs to assess results within a level of accuracy to ensure that the calculated values are good. Where the variation is fairly wide it is difficult to assess whether the difference is caused by numerical variation (a good result), or by an error due to a failure in the computation(a bad result). The use of homogeneous redundancy should allow computational errors to be eliminated, leaving only the numerical variation to worry about. Once the range of legitimate variation is known then subsequent analysis can cope easily with it. Giskard - the first telepathic robot. ID: 4512 · Reply Quote

Alex Send message Joined: 2 Sep 04 Posts: 378 Credit: 10,765 RAC: 0	Message 4524 - Posted: 28 Oct 2004, 5:47:51 UTC The funny thing is that Gentoo is a 'source' distro, ie.. you compile everything you 'install'. Perhaps the 'emerge' command compiles binaries with certain compiler switches which affect the boinc clients. ______________________________________________________________ Did your tech wear a static strap? No? Well, there ya go! :p ID: 4524 · Reply Quote

Granite T. Rock Send message Joined: 24 Oct 04 Posts: 6 Credit: 67,366 RAC: 0	Message 4526 - Posted: 28 Oct 2004, 6:02:53 UTC - in response to Message 4512. > The use of homogeneous redundancy should allow computational errors to be > eliminated, leaving only the numerical variation to worry about. Once the > range of legitimate variation is known then subsequent analysis can cope > easily with it. ahhhhhhhhhhhh... (Lightbulb turning on moment)... Thanks... I like it when that happens ID: 4526 · Reply Quote

Toby Send message Joined: 1 Sep 04 Posts: 137 Credit: 1,769,043 RAC: 0	Message 4529 - Posted: 28 Oct 2004, 6:57:07 UTC - in response to Message 4524. > The funny thing is that Gentoo is a 'source' distro, ie.. you compile > everything you 'install'. > Perhaps the 'emerge' command compiles binaries with certain compiler switches > which affect the boinc clients. It only uses the flags you specify. "Complete customization" It is probably more likely that some new library messed things up. On average I would say gentoo users have a more up-to-date system than most people who run other distros. Simply because it is so easy to update all your packages at once. Something may have gotten in with a bug or maybe predictor is using some depricated funcitons that were finally not supported anymore in a new library. I am pretty sure there were a couple of people who reported the same problem with another distro - probably those who updated the same libraries. But this should really be on the predictor message boards, not the LHC ones :) -------------------------------------- A member of The Knights Who Say Ni! My BOINC stats site ID: 4529 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 4532 - Posted: 28 Oct 2004, 7:42:24 UTC >>But this should really be on the predictor message boards, not the LHC ones :) The subject of the thread is the possible use of homogeneous redundancy on LHC@Home. If another project has experience that might reflect on that then it seems valid to discuss it here. The Gentoo problems would occur on that platform regardless of whether homogeneous redundancy was in place or not. I'd guess that the admins at CERN have a good idea of which platforms give the most errors, and can estimate the risk associated with a project change at this stage. Giskard - the first telepathic robot. ID: 4532 · Reply Quote

Send message Joined: 25 Oct 04 Posts: 27 Credit: 1,205 RAC: 0	Message 4554 - Posted: 28 Oct 2004, 11:53:16 UTC Correct me if I’m wrong As an example for it:- If one fpu type calculates the ring to be 27,000,001mm long and another 26.999,999mm then the results differs by 2mm That would be out of the validated range as its probably looking for differences of a couple of microns (Although I don’t expect the above example to be correct but demonstrates the problem) Data can be used from differing fpu types the point of the particle at the end of the calculation should be consistent with the formula used within the fpu so as like shown in the above example it would differ between different fpus Collectively 3 results on one fpu type should match, And 3 correct results for a different fpu type with the same WU would mach but together you would end up with 2 different results What matters is you know the spot you want it to be at (that would show as different for each fpu type) and what matters is where it ends up (that’s relative to the spot you want it to be at) Calibration of the results should be possible but that is a math problem and not something the validator would have enough time for the compensation "constant" value that would be needed to correct it would differ depending on the workunits lenth Suppose its better that work units are given to machines with the same fpu type As it saves validation time and processor load Dave ID: 4554 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 4557 - Posted: 28 Oct 2004, 13:05:21 UTC - in response to Message 4554. > Correct me if I’m wrong > I think I've followed your argument, and you're basically correct. The validator's purpose is to compare results and reject those that are 'bad' results. i.e. the results that have suffered a corruption during the computation. Certainly, correction of inaccuracies falls outside the scope of the validator's purpose. This is properly the scope of any subsequent statistical analysis of the results. But here it is more important to know the size of the possible error, rather than correcting the error before processing. Inaccuracies are in the nature of binary floating point arithmetic. If it is apparent that the inaccuracies are consistent within a processor family, but inconsistent between families then my feeling is that the two families of processors should be separated. Giskard - the first telepathic robot. ID: 4557 · Reply Quote

Send message Joined: 25 Oct 04 Posts: 27 Credit: 1,205 RAC: 0	Message 4559 - Posted: 28 Oct 2004, 13:55:49 UTC but is there also the case where difrences in the way diferent opperating systems interact with the fpu also that can lead to diferent results on the same fpu but under diferent opperating systems an example if the last bit is rounded diferently between diferent o/s genraly a 32bit floating point calculation contains an extra result bit(s) that are striped from the result but it can be possible for the os to detect it and carry its state to the next calculation diferences here would arrise if one o/s reset this each time and another passed this to the next calculation this would give the situation that diferent opperating systems gave differing results also or that the opperating systems are using one of the other floating point number formats (i beleve there are a few standards for the binary representation of a floating point number) although im not shure what standard is used by each o/s Dave ID: 4559 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 4569 - Posted: 28 Oct 2004, 16:28:47 UTC I see what you're getting at, but I can't see why the operating system would be fiddling with application data. It's more likely that either the compiler implementation differs in its treatment of integer to single or double to single conversions, or the host machine has a fault in its FPU. Giskard - the first telepathic robot. ID: 4569 · Reply Quote

Send message Joined: 25 Oct 04 Posts: 27 Credit: 1,205 RAC: 0	Message 4570 - Posted: 28 Oct 2004, 18:16:36 UTC no not to do with a hardware error/fault but found this to do with "mathsworks" quote """ The Repeating Sequence block generates different numbers on Windows than on Linux. The differences are slight: on the order of 1e-16. However, the differences can be significant when simulation results depend on comparing the generated sequence, e.g., to zero. A workaround is to quantize the output of the Repeating Sequence block so that values near 0 become zero. For example, you could put a Dead Zone block with zone [0...1e-15] between the Repeating Sequence block and the block performing the comparison. """ on same hardware platform ! Dave ID: 4570 · Reply Quote

[BOINCstats] Willy Send message Joined: 2 Sep 04 Posts: 7 Credit: 371,270 RAC: 0	Message 4576 - Posted: 28 Oct 2004, 21:17:35 UTC I still find it awkward that different CPU’s produce different results, especially between Intel and AMD. Weren’t these supposed to be compatible. How are we going to find out which one of them is correct in calculation. 1+1 should always be 2, not 2.00000000000000000000000000001 on Intel and 1.999999999999999999999999999999999999999999999999999999 on AMD. ID: 4576 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 4577 - Posted: 29 Oct 2004, 5:13:38 UTC - in response to Message 4576. Last modified: 29 Oct 2004, 6:28:34 UTC > I still find it awkward that different CPU’s produce different results, > especially between Intel and AMD. Weren’t these supposed to be compatible. How > are we going to find out which one of them is correct in calculation. > > 1+1 should always be 2, not 2.00000000000000000000000000001 on Intel and > 1.999999999999999999999999999999999999999999999999999999 on AMD. > > I haven't tried this, but I suspect that 1.0e0 + 1.0e0 (I use the FP notation deliberately) is 2.0e0 on both processors. 1.567345635467435E3 * 3.564344567322e4 isn't likely to be as exact. The problem is that we are representing a Base10 FP number in binary, which inevitably carries with it some approximation, and hence error. Both processors are correct in calculation, subject to a limit on accuracy. For a single operation the difference is likely to be less than 1 x 10^-15. We are, however, way off topic, and this isn't really my field. If you want to know more then search Google for references to IEEE754 - that should tell you all you need to know! Giskard - the first telepathic robot. ID: 4577 · Reply Quote

Send message Joined: 25 Oct 04 Posts: 27 Credit: 1,205 RAC: 0	Message 4597 - Posted: 29 Oct 2004, 13:29:57 UTC If you check your results and the processes that have also completed the work units (verified results) I think you will see that platform differences are not proving to be an issue at present I should also add that from how I read it homogeneous redundancy can be turned on and off by “ Alternatively, you can enable it selectively for a single application by setting the homogeneous_redundancy field in its database record. “ It is also beneficial to leave it off as it can be useful to know if the inconstancy between platforms starts to be a bigger problem And if one particular platform started to constantly fail / then it would become noticeable quickly rather than having to re issue All work units given for that platform And what’s the use of allowing for a possible 100% incorrect but valid results when the platform they where processed on gave them the same error That with homogeneous redundancy enabled would only provide an incorrect picture of the simulation that would lead to bigger problems and cost lots of time, expense and possibly redesign of the LHC to compensate for a non existent correction If there are differences between results from platforms then there is a problem in the maths some ware If not then the calculations are fine and good data is being collected And what better way to ensure that data is valid than to have valid cross platform results that are valid Dave ID: 4597 · Reply Quote

Send message Joined: 25 Oct 04 Posts: 27 Credit: 1,205 RAC: 0	Message 4610 - Posted: 29 Oct 2004, 15:20:52 UTC Off topic but A rough calculation of the error could give a difference of about 0.000027 mm on a 1,000,000 turn calculation Between the amd and the Intel fpu's And an actual error on both of approximately the same amount Dew to the limits of 32 bit maths Dave ID: 4610 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 4618 - Posted: 29 Oct 2004, 17:48:30 UTC - in response to Message 4610. > Off topic but > > A rough calculation of the error could give a difference of about > 0.000027 mm on a 1,000,000 turn calculation > Between the amd and the Intel fpu's > And an actual error on both of approximately the same amount > Dew to the limits of 32 bit maths > > Dave > > > Pentium 4 FPU registers are 128 bits wide. P3 is 64 bits wide, I think. As I said, this isn't my field. Giskard - the first telepathic robot. ID: 4618 · Reply Quote

Send message Joined: 25 Oct 04 Posts: 27 Credit: 1,205 RAC: 0	Message 4620 - Posted: 29 Oct 2004, 18:19:45 UTC >Pentium 4 FPU registers are 128 bits wide. P3 is 64 bits wide yes but a float32 instruction yields a 32bit result regardless of the capability of the fpu returning a 64 bit one would require a rewrite of the code that was using it and it would become a float64 ... same situation but loosing those processes not able to handle 64bit floats Dave ID: 4620 · Reply Quote

LHC@home