Message boards : Number crunching : Results discrepancies
Message board moderation

To post messages, you must log in.

AuthorMessage
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25507 - Posted: 25 Mar 2013, 12:59:37 UTC

Well, here I am sitting by the pool, looking at the
same problem I found while sitting here 4 months
ago! Let me remind you I am/was running two very
dense intensity scans, each consisting of a dozen
or so studies, each with a different beam intensity
i.e bunch charge ranging from 0 to a maximum of that of
400,000,000,000 protons. In turn each study had up to
100,000 different cases with different initial amplitudes,
angles, and magnet errors, each attempting one million turns.
The result of each case is a file fort.10 (gzipped)
containing one line of 60 double precision precision
numbers for each pair of particles being tracked
(typically 30 lines). The BOINC validations checks that
50 or so of these numbers are identical, excluding a few
values like the SixTrack Version number and the CPU time.
While re-running a few cases in a study to clear the tail
of incomplete cases I found I had two VALIDATED results for
one case which were DIFFERENT!!! For me this is a big problem
and I have been really worried trying to find an explanation.

The (very) good news is that the differences are small and that
the physics results are not seriously affected. The bad news is
that I have not (yet) found out how this can happen or the number
of cases affected. In this particular case each of the 30 lines
had 20 differences (the first 9 numbers on each line reflect the
input and prove that it was identical in all runs). The numbers
which are different are the same at roughly the level of single
precision BUT I expect and normally obtain 0 ULP difference in all
caes i.e. NO difference at all.

I remind you we run each case twice and accept results when we have
two identical results, re-running as necessary, until that is true.
The odd man (men) out are rejected and are normally attributable to
hardware error e.g. an overclocked CPU. We assume that the probability
of getting the same wrong answer twice is very very small and can be
ignored.

I now reran this case (known as PROBLEM) on various systems at CERN and
with different compilers and always got the same result. I also reran
the case on BOINC several times and also got the same answer. I also ran
with my new SixTrack Version 4446 instead of 4441 and again got the ame
answer. With some help I checked the BOINC validation logs and looked
for the two hosts which returned the same wrong result but they seemd
to be OK hosts (my detailed notes are in Geneva).

I then ran into problems with disk space and the power suppy on my
desktop Linux box failed. However I had copied one scan to another Linux box
and continued that intensity scan from there (WLXSCAN0) an suspended activity
on WLSCAN2 with the PROBLEM.

This scan WLXSCAN0 is now pretty much complete after adjustment of the
initial amplitude ranges, lower bounds for high intensity studies,
higher bounds for low intensity studies. As a check I took a subrange of
initial amplitudes of one study and ran those cases again on BOINC.
(Now my other deskside Linux box is down so I can't check the number of
cases right now, but it is of the order of a few tens of thousands.)
Crosschecking now found 5 result discrepancies, which I call PROB1, PROB2,
..., to PROB5.

PROB1 has one/line pair different out of 30, words 12, 14, 19, 20 etc.
PROB2 gives an I/O error because two bytes on one line which should be
"00" are "o_". This has probably been overwritten subsequent to BOINC
validation and can be ignored as I reject it automatically.
PROB3 has one word different, Word 10, on one line.
PROB4 has ALL lines/pairs different from Word 10, 12, 13, 14 onwards.
PROB5 has one line, one word 10 different.

So, all results apart from PROB2 are acceptable (and it is rejected so
no problem) because the lost turn number is the same. Hence physics should
be OK. However I wonder if there are other cases I haven't double checked.
The number of errors detected is (very) small but is still a big worry.
I have never seen this before. I have made a search for the hosts concerned
but so far cannot find a common factor. I CANNOT reproduce the errors at CERN or
on BOINC despite running these 5 cases many times. I (strongly) suspect an
error in the SixTrack post-processing which has been changed recently. I also
have ideas about the results depending on the date/time!!! Variable length strings
might cause alignment problems leading to a different code path. Strange
but has happened. I cannot reproduce the problems even with three other
compilers here at CERN. All results are correct and identical down to the
last bit.

I am reluctant to rerun an entire study as I feel it is a "waste" of your
computer time, but it may come to that in the end. I also get correct
results with the latest SixTrack version 4446 which I may therefore install
anyway.

All this to keep you informed, if you are interested, and to explain my
reluctance to finish of the intensity scans right now. So back to desk
checking the post-processing and back to CERN on Wednesday.

Finally, if you have read or skipped here, I quote from the latest press
release from the DG:

From rolf.heuer@cern.ch Thu Mar 14 10:30:39 2013
New results indicate that particle discovered at CERN is a Higgs boson.
At the Moriond Conference today, the ATLAS and CMS
collaborations at CERN?s Large Hadron Collider (LHC) presented preliminary new
results that further elucidate the particle discovered last year. Having
analysed two and a half times more data than was available for the discovery
announcement in July, they find that the new particle is looking more and more
like a Higgs boson, the particle linked to the mechanism that gives mass to
elementary particles. It remains an open question, however, whether this is the
Higgs boson of the Standard Model of particle physics, or possibly the lightest
of several bosons predicted in some theories that go beyond the Standard Model.
Finding the answer to this question will take time.
etc etc etc.
The detection of the boson is a very rare event - it takes around 1 trillion
(1012) proton-proton collisions for each observed event. To characterize all
of the decay modes will require much more data from the LHC.
ID: 25507 · Report as offensive     Reply Quote
Ravens

Send message
Joined: 14 Dec 06
Posts: 29
Credit: 128,225
RAC: 0
Message 25508 - Posted: 25 Mar 2013, 13:28:42 UTC - in response to Message 25507.  

If re-running the study will improve the science, I don't consider it 'wasting' my CPU time. I'm in!
ID: 25508 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 850
Credit: 692,713,859
RAC: 95,524
Message 25509 - Posted: 25 Mar 2013, 23:05:01 UTC

Is there a way to improve the error checking to ensure that any results that don't match are caught?

My thoughts are the root cause is a PC error? If it's slow WU then the probability of something going wrong is bigger? and hence in the tail?
ID: 25509 · Report as offensive     Reply Quote
carpe noctem...

Send message
Joined: 19 Apr 10
Posts: 2
Credit: 2,394,868
RAC: 0
Message 25510 - Posted: 26 Mar 2013, 2:31:22 UTC

"We assume that the probability of getting the same wrong answer twice is very very small and can be ignored."

What about the results from cpu releases with known design errors, which were resolved in a next step, but are still in use and produces errors in just a few small situations?

Voltage Peaks at the same time in two different countries are possible, too.
Most of the personal computers doesn´t have redundant components like ecc ram, ups etc.

If you need reliable results, i guess it is necessary to send tasks out to more than only two hosts.
There also exists other projects in the distributing computing world which also need reliable results and the scientists there discover the necessarity to send one task up to eighteen (!) hosts at the same time.

Sorry for my bad english.I hope you don´t mind it. It is not my native language.
ID: 25510 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25511 - Posted: 26 Mar 2013, 17:59:05 UTC - in response to Message 25510.  

You are correct and I am aware of these possibilities.....
Running 3 times seems a bit much though. Eric.
ID: 25511 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 850
Credit: 692,713,859
RAC: 95,524
Message 25512 - Posted: 27 Mar 2013, 0:00:55 UTC - in response to Message 25511.  

Surely, running 3 times to ensure that science has the correct results is worthwhile?

Even if it's a temporary measure until you understand why there was some issues.

I assume you can add some additional logging to understand the issues?
ID: 25512 · Report as offensive     Reply Quote
Ravens

Send message
Joined: 14 Dec 06
Posts: 29
Credit: 128,225
RAC: 0
Message 25513 - Posted: 27 Mar 2013, 16:34:23 UTC - in response to Message 25512.  

I concur Toby - it's worth it.
ID: 25513 · Report as offensive     Reply Quote
Christoph

Send message
Joined: 25 Aug 05
Posts: 69
Credit: 306,627
RAC: 0
Message 25514 - Posted: 29 Mar 2013, 0:57:39 UTC

I will think about this before giving my opinion......I started to write something and then stopped it.
Better to sleep first late night before speaking up.
Christoph
ID: 25514 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25521 - Posted: 21 Apr 2013, 11:50:32 UTC

Just a brief update; I took advantage of the Easter break
to rerun part of a study with a replication factor of
three and requesting that three results match. Still found
five result discrepancies! ou of 30,000! I must have a
problem but I cannot replicate it at CERN. I am still
analysing he database and looking at hosts and also for
rejected results. I am also trying to figure out how to use
some volunteers to help me test. Sorry for delay.
(May start some production to finish an intensity scan as
the anomalies are so few in number.) Eric.
ID: 25521 · Report as offensive     Reply Quote
carpe noctem...

Send message
Joined: 19 Apr 10
Posts: 2
Credit: 2,394,868
RAC: 0
Message 25522 - Posted: 21 Apr 2013, 14:31:38 UTC

I´m keeping my fingers crossed tightly that a solution will be found very soon !
Thank you for keeping us up-to-date
ID: 25522 · Report as offensive     Reply Quote
Profile Robert Pick

Send message
Joined: 1 Dec 05
Posts: 62
Credit: 11,441,610
RAC: 0
Message 25523 - Posted: 23 Apr 2013, 21:06:20 UTC - in response to Message 25521.  

Just a Question. I checked the server status this morning and it said 307 tasks in progress and ( 0 )users in the past 24 Hrs. It just might take forever to finish the last 307 WU. Can this be true? Pick
ID: 25523 · Report as offensive     Reply Quote
Profile jay

Send message
Joined: 10 Aug 07
Posts: 56
Credit: 831,474
RAC: 0
Message 25524 - Posted: 24 Apr 2013, 16:10:07 UTC

Greetings,
Is there a way to replicate a/the problem WU several thousand times; send them out for processing; and then check results against similarities in computing platforms or floating point options?
I think it is useful to run as may tests as needed to rule out possibilities.

Wild thought: Is it possible to create a 'double'wu where a task runs the same WU data on the same platform twice? This *might* rule out a dependency on time.

Ask 10 people, get 20 answers. I'm sure you are swamped with well-meaning advice.

Best wishes.

Jay
ID: 25524 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25526 - Posted: 26 Apr 2013, 5:22:10 UTC - in response to Message 25523.  

Me too; and I have noticed this before as well.
I put in 100 cases yesterday so there are some which
should finish over the weekend. Eric.
ID: 25526 · Report as offensive     Reply Quote
Profile jay

Send message
Joined: 10 Aug 07
Posts: 56
Credit: 831,474
RAC: 0
Message 25538 - Posted: 5 May 2013, 21:13:56 UTC - in response to Message 25526.  

Another wild thought.
The other admins of BOINC projects may have encountered this problem.
Do they have suggestions or 'lessons-learned'?

Thanks for all of your work !!
Jay
ID: 25538 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25539 - Posted: 8 May 2013, 17:53:25 UTC

Thanks for all comments and advice. Maybe this goal is impossible
after all :-). I am hanging in there though. I suspect a bug or an
obscure problem with data alignment and the ifort compiler.
In the meantime I am pressing on to complete an intensity scan,
especially as most physicists are heading off to ICAP in Shanghai.
Eric.
ID: 25539 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25602 - Posted: 25 May 2013, 10:52:25 UTC

Thanks for all the comments. I am running many many times the
failing cases on a machine at CERN to see what happens (an experiment!).
After that, and after a check of current BOINC statistics on failures
I plan to submit 1 (maybe) more failing cases to you a 100,000 times
with a redundancy of 1 (so you get credit) and also so I can validate
results (rather than BOINC) and get more statistics. Remember even
a failed result with these cases has been replicated three times!
Failures appear random, but not totally. I feel a statistical analysis
is required. In the meantime I am completing an intensity scan (once
the server is back...:-(. I have prepared an outline for a paper on
all this replication business which I hope to publish after yet more verification
of my results. Thanks for your help, patience and understanding. Eric.

ID: 25602 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 25625 - Posted: 31 May 2013, 7:58:40 UTC

Well, after the last round of tests, desk checking, and a first look
at the statistics, I found a BUG in SixTrack, Mea Culpa. Clearly it does
not affect many cases but it could explain everything or nothing!
(For the record a Fortran routine was calling a C routine with 2
parameters when 3 were required. The normal check I do of matching
actual and dummy parameters does not cover calling C from Fortran.)
I am praying that this will explain the anomalous results. Keeping my
fingers crossed and thanks for your help and support. Eric.
ID: 25625 · Report as offensive     Reply Quote

Message boards : Number crunching : Results discrepancies


©2024 CERN