Message boards : Number crunching : @Markku - homogeneous redundancy
Message board moderation

To post messages, you must log in.

AuthorMessage
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 4481 - Posted: 27 Oct 2004, 13:47:32 UTC

Markku

I just checked the scheduler source code. The homogeneous redundancy feature checks the processor type for the strings 'Intel', 'AMD' and 'Macintosh'. OS names checked are 'Windows', 'Linux', 'Darwin' and 'SunOS'. It seems likely that just turning on homogeneous redundancy will eliminate the sort of cross-platform inconsistencies you're getting at a stroke, and without setting up any new platforms.

Rom Walton at Berkeley should be able to confirm this.

Sorry if you've been here already. I just hope this is useful.


Giskard - the first telepathic robot.


ID: 4481 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,691,526
RAC: 26
Message 4496 - Posted: 27 Oct 2004, 17:52:04 UTC

Predictor was using this feature and it ended up biting them in the rear. For some reason their clients started erroring out on gentoo linux (and a couple other distros I think). So all connected hosts that were running gentoo suddenly started downloading the max 50 work units per day and returning them all as errors within seconds. But by then they had been flagged as 'linux' work units and could not be given out to computers with another OS. Eventually all the work units became "gentooized" and no one else was getting any work with the message (there was work but not for your platform).

Just something to consider. The benefits may outweigh the risks in this case. Not sure if the BOINC admins have a mailing list where they discuss such matters...


--------------------------------------
A member of The Knights Who Say Ni!
My BOINC stats site
ID: 4496 · Report as offensive     Reply Quote
Profile tiker
Avatar

Send message
Joined: 1 Sep 04
Posts: 34
Credit: 19,096
RAC: 0
Message 4499 - Posted: 27 Oct 2004, 19:53:58 UTC - in response to Message 4496.  
Last modified: 27 Oct 2004, 19:55:35 UTC

> Predictor was using this feature and it ended up biting them in the rear. For
> some reason their clients started erroring out on gentoo linux (and a couple
> other distros I think). So all connected hosts that were running gentoo
> suddenly started downloading the max 50 work units per day and returning them
> all as errors within seconds. But by then they had been flagged as 'linux'
> work units and could not be given out to computers with another OS.
> Eventually all the work units became "gentooized" and no one else was getting
> any work with the message (there was work but not for your platform).

That was caused by a bug with the Gentoo kernel if I remember correctly. There's enough linux boxes to crunch the data, the only problem with Predictor was the timing of the returned results.

I'd like to see the homogeneous redundancy enabled.
---
ID: 4499 · Report as offensive     Reply Quote
Granite T. Rock

Send message
Joined: 24 Oct 04
Posts: 6
Credit: 67,366
RAC: 0
Message 4501 - Posted: 27 Oct 2004, 20:13:45 UTC

This may be a silly question:

If there is not cross-platform consistency would LHC not want to eliminate this inconsistancy by finding out the source of it rather than ensuring that the same work unit is processed by the same platform?

My reasoning is that if two seperate platforms produce distinct and different results, then how I know which result or even if the results are valid? I could however, see how homogeneous redundancy being used as an aid in tracking down and understanding how the different platforms behave differently and use that information to make the sourcecode stronger in the future.

Thanks!
ID: 4501 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 4512 - Posted: 27 Oct 2004, 21:45:34 UTC

The difficulty with eliminating the inconsistency is that the problem largely originates in the underlying hardware. Where hardware floating point maths is used (and it is used extensively since it is much faster than software fp maths) the fp operations often use lookup tables to improve the speed (Chebyshev constants). The lookup tables used can vary according to the algorithms used to generate them, and the level of accuracy to which they are calculated.

These differences aren't significant unless the accuracy of your calculations is extreme. Sixtrack seems to be particularly sensitive, possible because of the large numbers of flops carried out (10^14 flops is common).

The BOINC validator needs to assess results within a level of accuracy to ensure that the calculated values are good. Where the variation is fairly wide it is difficult to assess whether the difference is caused by numerical variation (a good result), or by an error due to a failure in the computation(a bad result).

The use of homogeneous redundancy should allow computational errors to be eliminated, leaving only the numerical variation to worry about. Once the range of legitimate variation is known then subsequent analysis can cope easily with it.


Giskard - the first telepathic robot.


ID: 4512 · Report as offensive     Reply Quote
Profile Alex

Send message
Joined: 2 Sep 04
Posts: 378
Credit: 10,765
RAC: 0
Message 4524 - Posted: 28 Oct 2004, 5:47:51 UTC

The funny thing is that Gentoo is a 'source' distro, ie.. you compile everything you 'install'.
Perhaps the 'emerge' command compiles binaries with certain compiler switches which affect the boinc clients.

______________________________________________________________
Did your tech wear a static strap? No? Well, there ya go! :p
ID: 4524 · Report as offensive     Reply Quote
Granite T. Rock

Send message
Joined: 24 Oct 04
Posts: 6
Credit: 67,366
RAC: 0
Message 4526 - Posted: 28 Oct 2004, 6:02:53 UTC - in response to Message 4512.  


> The use of homogeneous redundancy should allow computational errors to be
> eliminated, leaving only the numerical variation to worry about. Once the
> range of legitimate variation is known then subsequent analysis can cope
> easily with it.


ahhhhhhhhhhhh... (Lightbulb turning on moment)... Thanks... I like it when that happens
ID: 4526 · Report as offensive     Reply Quote
Toby

Send message
Joined: 1 Sep 04
Posts: 137
Credit: 1,691,526
RAC: 26
Message 4529 - Posted: 28 Oct 2004, 6:57:07 UTC - in response to Message 4524.  

> The funny thing is that Gentoo is a 'source' distro, ie.. you compile
> everything you 'install'.
> Perhaps the 'emerge' command compiles binaries with certain compiler switches
> which affect the boinc clients.

It only uses the flags you specify. "Complete customization" It is probably more likely that some new library messed things up. On average I would say gentoo users have a more up-to-date system than most people who run other distros. Simply because it is so easy to update all your packages at once. Something may have gotten in with a bug or maybe predictor is using some depricated funcitons that were finally not supported anymore in a new library. I am pretty sure there were a couple of people who reported the same problem with another distro - probably those who updated the same libraries.

But this should really be on the predictor message boards, not the LHC ones :)


--------------------------------------
A member of The Knights Who Say Ni!
My BOINC stats site
ID: 4529 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 4532 - Posted: 28 Oct 2004, 7:42:24 UTC

>>But this should really be on the predictor message boards, not the LHC ones :)

The subject of the thread is the possible use of homogeneous redundancy on LHC@Home. If another project has experience that might reflect on that then it seems valid to discuss it here.

The Gentoo problems would occur on that platform regardless of whether homogeneous redundancy was in place or not. I'd guess that the admins at CERN have a good idea of which platforms give the most errors, and can estimate the risk associated with a project change at this stage.


Giskard - the first telepathic robot.


ID: 4532 · Report as offensive     Reply Quote

Send message
Joined: 25 Oct 04
Posts: 27
Credit: 1,205
RAC: 0
Message 4554 - Posted: 28 Oct 2004, 11:53:16 UTC

Correct me if I�m wrong

As an example for it:-

If one fpu type calculates the ring to be 27,000,001mm long and another 26.999,999mm then the results differs by 2mm
That would be out of the validated range as its probably looking for differences of a couple of microns
(Although I don�t expect the above example to be correct but demonstrates the problem)
Data can be used from differing fpu types the point of the particle at the end of the calculation should be consistent with the formula used within the fpu so as like shown in the above example it would differ between different fpus

Collectively 3 results on one fpu type should match,
And 3 correct results for a different fpu type with the same WU would mach but together you would end up with 2 different results
What matters is you know the spot you want it to be at (that would show as different for each fpu type) and what matters is where it ends up (that�s relative to the spot you want it to be at)
Calibration of the results should be possible but that is a math problem and not something the validator would have enough time for

the compensation "constant" value that would be needed to correct it would differ depending on the workunits lenth

Suppose its better that work units are given to machines with the same fpu type
As it saves validation time and processor load

Dave

ID: 4554 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 4557 - Posted: 28 Oct 2004, 13:05:21 UTC - in response to Message 4554.  

> Correct me if I�m wrong
>

I think I've followed your argument, and you're basically correct. The validator's purpose is to compare results and reject those that are 'bad' results. i.e. the results that have suffered a corruption during the computation.

Certainly, correction of inaccuracies falls outside the scope of the validator's purpose. This is properly the scope of any subsequent statistical analysis of the results. But here it is more important to know the size of the possible error, rather than correcting the error before processing.

Inaccuracies are in the nature of binary floating point arithmetic. If it is apparent that the inaccuracies are consistent within a processor family, but inconsistent between families then my feeling is that the two families of processors should be separated.


Giskard - the first telepathic robot.


ID: 4557 · Report as offensive     Reply Quote

Send message
Joined: 25 Oct 04
Posts: 27
Credit: 1,205
RAC: 0
Message 4559 - Posted: 28 Oct 2004, 13:55:49 UTC

but is there also the case where difrences in the way diferent opperating systems interact with the fpu also that can lead to diferent results on the same fpu but under diferent opperating systems

an example
if the last bit is rounded diferently between diferent o/s
genraly a 32bit floating point calculation contains an extra result bit(s) that are striped from the result but it can be possible for the os to detect it and carry its state to the next calculation
diferences here would arrise if one o/s reset this each time and another passed this to the next calculation
this would give the situation that diferent opperating systems gave differing results also

or that the opperating systems are using one of the other floating point number formats
(i beleve there are a few standards for the binary representation of a floating point number)
although im not shure what standard is used by each o/s

Dave

ID: 4559 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 4569 - Posted: 28 Oct 2004, 16:28:47 UTC

I see what you're getting at, but I can't see why the operating system would be fiddling with application data.

It's more likely that either the compiler implementation differs in its treatment of integer to single or double to single conversions, or the host machine has a fault in its FPU.


Giskard - the first telepathic robot.


ID: 4569 · Report as offensive     Reply Quote

Send message
Joined: 25 Oct 04
Posts: 27
Credit: 1,205
RAC: 0
Message 4570 - Posted: 28 Oct 2004, 18:16:36 UTC

no not to do with a hardware error/fault
but found this to do with "mathsworks"
quote
"""
The Repeating Sequence block generates different numbers on Windows than on Linux. The differences are slight: on the order of 1e-16. However, the differences can be significant when simulation results depend on comparing the generated sequence, e.g., to zero.
A workaround is to quantize the output of the Repeating Sequence block so that values near 0 become zero. For example, you could put a Dead Zone block with zone [0...1e-15] between the Repeating Sequence block and the block performing the comparison.
"""
on same hardware platform !

Dave


ID: 4570 · Report as offensive     Reply Quote
[BOINCstats] Willy

Send message
Joined: 2 Sep 04
Posts: 7
Credit: 371,270
RAC: 0
Message 4576 - Posted: 28 Oct 2004, 21:17:35 UTC

I still find it awkward that different CPU�s produce different results, especially between Intel and AMD. Weren�t these supposed to be compatible. How are we going to find out which one of them is correct in calculation.

1+1 should always be 2, not 2.00000000000000000000000000001 on Intel and 1.999999999999999999999999999999999999999999999999999999 on AMD.


ID: 4576 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 4577 - Posted: 29 Oct 2004, 5:13:38 UTC - in response to Message 4576.  
Last modified: 29 Oct 2004, 6:28:34 UTC

> I still find it awkward that different CPU�s produce different results,
> especially between Intel and AMD. Weren�t these supposed to be compatible. How
> are we going to find out which one of them is correct in calculation.
>
> 1+1 should always be 2, not 2.00000000000000000000000000001 on Intel and
> 1.999999999999999999999999999999999999999999999999999999 on AMD.
>
>

I haven't tried this, but I suspect that 1.0e0 + 1.0e0 (I use the FP notation deliberately) is 2.0e0 on both processors. 1.567345635467435E3 * 3.564344567322e4 isn't likely to be as exact. The problem is that we are representing a Base10 FP number in binary, which inevitably carries with it some approximation, and hence error. Both processors are correct in calculation, subject to a limit on accuracy. For a single operation the difference is likely to be less than 1 x 10^-15.

We are, however, way off topic, and this isn't really my field. If you want to know more then search Google for references to IEEE754 - that should tell you all you need to know!


Giskard - the first telepathic robot.


ID: 4577 · Report as offensive     Reply Quote

Send message
Joined: 25 Oct 04
Posts: 27
Credit: 1,205
RAC: 0
Message 4597 - Posted: 29 Oct 2004, 13:29:57 UTC


If you check your results and the processes that have also completed the work units (verified results) I think you will see that platform differences are not proving to be an issue at present

I should also add that from how I read it homogeneous redundancy can be turned on and off by

Alternatively, you can enable it selectively for a single application by setting the homogeneous_redundancy field in its database record.

It is also beneficial to leave it off as it can be useful to know if the inconstancy between platforms starts to be a bigger problem

And if one particular platform started to constantly fail / then it would become noticeable quickly rather than having to re issue
All work units given for that platform

And what�s the use of allowing for a possible 100% incorrect but valid results when the platform they where processed on gave them the same error
That with homogeneous redundancy enabled would only provide an incorrect picture of the simulation that would lead to bigger problems and cost lots of time, expense and possibly redesign of the LHC to compensate for a non existent correction

If there are differences between results from platforms then there is a problem in the maths some ware
If not then the calculations are fine and good data is being collected

And what better way to ensure that data is valid than to have valid cross platform results that are valid

Dave

ID: 4597 · Report as offensive     Reply Quote

Send message
Joined: 25 Oct 04
Posts: 27
Credit: 1,205
RAC: 0
Message 4610 - Posted: 29 Oct 2004, 15:20:52 UTC

Off topic but

A rough calculation of the error could give a difference of about
0.000027 mm on a 1,000,000 turn calculation
Between the amd and the Intel fpu's
And an actual error on both of approximately the same amount
Dew to the limits of 32 bit maths

Dave

ID: 4610 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 4618 - Posted: 29 Oct 2004, 17:48:30 UTC - in response to Message 4610.  

> Off topic but
>
> A rough calculation of the error could give a difference of about
> 0.000027 mm on a 1,000,000 turn calculation
> Between the amd and the Intel fpu's
> And an actual error on both of approximately the same amount
> Dew to the limits of 32 bit maths
>
> Dave
>
>
>

Pentium 4 FPU registers are 128 bits wide. P3 is 64 bits wide, I think. As I said, this isn't my field.

Giskard - the first telepathic robot.


ID: 4618 · Report as offensive     Reply Quote

Send message
Joined: 25 Oct 04
Posts: 27
Credit: 1,205
RAC: 0
Message 4620 - Posted: 29 Oct 2004, 18:19:45 UTC


>Pentium 4 FPU registers are 128 bits wide. P3 is 64 bits wide

yes
but a float32 instruction yields a 32bit result regardless of the capability of the fpu

returning a 64 bit one would require a rewrite of the code that was using it and it would become a float64 ... same situation but loosing those processes not able to handle 64bit floats


Dave


ID: 4620 · Report as offensive     Reply Quote

Message boards : Number crunching : @Markku - homogeneous redundancy


©2024 CERN