Message boards :
Number crunching :
Results discrepancies
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well, here I am sitting by the pool, looking at the same problem I found while sitting here 4 months ago! Let me remind you I am/was running two very dense intensity scans, each consisting of a dozen or so studies, each with a different beam intensity i.e bunch charge ranging from 0 to a maximum of that of 400,000,000,000 protons. In turn each study had up to 100,000 different cases with different initial amplitudes, angles, and magnet errors, each attempting one million turns. The result of each case is a file fort.10 (gzipped) containing one line of 60 double precision precision numbers for each pair of particles being tracked (typically 30 lines). The BOINC validations checks that 50 or so of these numbers are identical, excluding a few values like the SixTrack Version number and the CPU time. While re-running a few cases in a study to clear the tail of incomplete cases I found I had two VALIDATED results for one case which were DIFFERENT!!! For me this is a big problem and I have been really worried trying to find an explanation. The (very) good news is that the differences are small and that the physics results are not seriously affected. The bad news is that I have not (yet) found out how this can happen or the number of cases affected. In this particular case each of the 30 lines had 20 differences (the first 9 numbers on each line reflect the input and prove that it was identical in all runs). The numbers which are different are the same at roughly the level of single precision BUT I expect and normally obtain 0 ULP difference in all caes i.e. NO difference at all. I remind you we run each case twice and accept results when we have two identical results, re-running as necessary, until that is true. The odd man (men) out are rejected and are normally attributable to hardware error e.g. an overclocked CPU. We assume that the probability of getting the same wrong answer twice is very very small and can be ignored. I now reran this case (known as PROBLEM) on various systems at CERN and with different compilers and always got the same result. I also reran the case on BOINC several times and also got the same answer. I also ran with my new SixTrack Version 4446 instead of 4441 and again got the ame answer. With some help I checked the BOINC validation logs and looked for the two hosts which returned the same wrong result but they seemd to be OK hosts (my detailed notes are in Geneva). I then ran into problems with disk space and the power suppy on my desktop Linux box failed. However I had copied one scan to another Linux box and continued that intensity scan from there (WLXSCAN0) an suspended activity on WLSCAN2 with the PROBLEM. This scan WLXSCAN0 is now pretty much complete after adjustment of the initial amplitude ranges, lower bounds for high intensity studies, higher bounds for low intensity studies. As a check I took a subrange of initial amplitudes of one study and ran those cases again on BOINC. (Now my other deskside Linux box is down so I can't check the number of cases right now, but it is of the order of a few tens of thousands.) Crosschecking now found 5 result discrepancies, which I call PROB1, PROB2, ..., to PROB5. PROB1 has one/line pair different out of 30, words 12, 14, 19, 20 etc. PROB2 gives an I/O error because two bytes on one line which should be "00" are "o_". This has probably been overwritten subsequent to BOINC validation and can be ignored as I reject it automatically. PROB3 has one word different, Word 10, on one line. PROB4 has ALL lines/pairs different from Word 10, 12, 13, 14 onwards. PROB5 has one line, one word 10 different. So, all results apart from PROB2 are acceptable (and it is rejected so no problem) because the lost turn number is the same. Hence physics should be OK. However I wonder if there are other cases I haven't double checked. The number of errors detected is (very) small but is still a big worry. I have never seen this before. I have made a search for the hosts concerned but so far cannot find a common factor. I CANNOT reproduce the errors at CERN or on BOINC despite running these 5 cases many times. I (strongly) suspect an error in the SixTrack post-processing which has been changed recently. I also have ideas about the results depending on the date/time!!! Variable length strings might cause alignment problems leading to a different code path. Strange but has happened. I cannot reproduce the problems even with three other compilers here at CERN. All results are correct and identical down to the last bit. I am reluctant to rerun an entire study as I feel it is a "waste" of your computer time, but it may come to that in the end. I also get correct results with the latest SixTrack version 4446 which I may therefore install anyway. All this to keep you informed, if you are interested, and to explain my reluctance to finish of the intensity scans right now. So back to desk checking the post-processing and back to CERN on Wednesday. Finally, if you have read or skipped here, I quote from the latest press release from the DG: From rolf.heuer@cern.ch Thu Mar 14 10:30:39 2013 New results indicate that particle discovered at CERN is a Higgs boson. At the Moriond Conference today, the ATLAS and CMS collaborations at CERN?s Large Hadron Collider (LHC) presented preliminary new results that further elucidate the particle discovered last year. Having analysed two and a half times more data than was available for the discovery announcement in July, they find that the new particle is looking more and more like a Higgs boson, the particle linked to the mechanism that gives mass to elementary particles. It remains an open question, however, whether this is the Higgs boson of the Standard Model of particle physics, or possibly the lightest of several bosons predicted in some theories that go beyond the Standard Model. Finding the answer to this question will take time. etc etc etc. The detection of the boson is a very rare event - it takes around 1 trillion (1012) proton-proton collisions for each observed event. To characterize all of the decay modes will require much more data from the LHC. |
Send message Joined: 14 Dec 06 Posts: 29 Credit: 128,225 RAC: 0 |
If re-running the study will improve the science, I don't consider it 'wasting' my CPU time. I'm in! |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
Is there a way to improve the error checking to ensure that any results that don't match are caught? My thoughts are the root cause is a PC error? If it's slow WU then the probability of something going wrong is bigger? and hence in the tail? |
Send message Joined: 19 Apr 10 Posts: 2 Credit: 2,394,868 RAC: 0 |
"We assume that the probability of getting the same wrong answer twice is very very small and can be ignored." What about the results from cpu releases with known design errors, which were resolved in a next step, but are still in use and produces errors in just a few small situations? Voltage Peaks at the same time in two different countries are possible, too. Most of the personal computers doesn´t have redundant components like ecc ram, ups etc. If you need reliable results, i guess it is necessary to send tasks out to more than only two hosts. There also exists other projects in the distributing computing world which also need reliable results and the scientists there discover the necessarity to send one task up to eighteen (!) hosts at the same time. Sorry for my bad english.I hope you don´t mind it. It is not my native language. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
You are correct and I am aware of these possibilities..... Running 3 times seems a bit much though. Eric. |
Send message Joined: 27 Sep 08 Posts: 850 Credit: 692,713,859 RAC: 95,524 |
Surely, running 3 times to ensure that science has the correct results is worthwhile? Even if it's a temporary measure until you understand why there was some issues. I assume you can add some additional logging to understand the issues? |
Send message Joined: 14 Dec 06 Posts: 29 Credit: 128,225 RAC: 0 |
I concur Toby - it's worth it. |
Send message Joined: 25 Aug 05 Posts: 69 Credit: 306,627 RAC: 0 |
I will think about this before giving my opinion......I started to write something and then stopped it. Better to sleep first late night before speaking up. Christoph |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Just a brief update; I took advantage of the Easter break to rerun part of a study with a replication factor of three and requesting that three results match. Still found five result discrepancies! ou of 30,000! I must have a problem but I cannot replicate it at CERN. I am still analysing he database and looking at hosts and also for rejected results. I am also trying to figure out how to use some volunteers to help me test. Sorry for delay. (May start some production to finish an intensity scan as the anomalies are so few in number.) Eric. |
Send message Joined: 19 Apr 10 Posts: 2 Credit: 2,394,868 RAC: 0 |
I´m keeping my fingers crossed tightly that a solution will be found very soon ! Thank you for keeping us up-to-date |
Send message Joined: 1 Dec 05 Posts: 62 Credit: 11,441,610 RAC: 0 |
Just a Question. I checked the server status this morning and it said 307 tasks in progress and ( 0 )users in the past 24 Hrs. It just might take forever to finish the last 307 WU. Can this be true? Pick |
Send message Joined: 10 Aug 07 Posts: 56 Credit: 831,474 RAC: 0 |
Greetings, Is there a way to replicate a/the problem WU several thousand times; send them out for processing; and then check results against similarities in computing platforms or floating point options? I think it is useful to run as may tests as needed to rule out possibilities. Wild thought: Is it possible to create a 'double'wu where a task runs the same WU data on the same platform twice? This *might* rule out a dependency on time. Ask 10 people, get 20 answers. I'm sure you are swamped with well-meaning advice. Best wishes. Jay |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Me too; and I have noticed this before as well. I put in 100 cases yesterday so there are some which should finish over the weekend. Eric. |
Send message Joined: 10 Aug 07 Posts: 56 Credit: 831,474 RAC: 0 |
Another wild thought. The other admins of BOINC projects may have encountered this problem. Do they have suggestions or 'lessons-learned'? Thanks for all of your work !! Jay |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks for all comments and advice. Maybe this goal is impossible after all :-). I am hanging in there though. I suspect a bug or an obscure problem with data alignment and the ifort compiler. In the meantime I am pressing on to complete an intensity scan, especially as most physicists are heading off to ICAP in Shanghai. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks for all the comments. I am running many many times the failing cases on a machine at CERN to see what happens (an experiment!). After that, and after a check of current BOINC statistics on failures I plan to submit 1 (maybe) more failing cases to you a 100,000 times with a redundancy of 1 (so you get credit) and also so I can validate results (rather than BOINC) and get more statistics. Remember even a failed result with these cases has been replicated three times! Failures appear random, but not totally. I feel a statistical analysis is required. In the meantime I am completing an intensity scan (once the server is back...:-(. I have prepared an outline for a paper on all this replication business which I hope to publish after yet more verification of my results. Thanks for your help, patience and understanding. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well, after the last round of tests, desk checking, and a first look at the statistics, I found a BUG in SixTrack, Mea Culpa. Clearly it does not affect many cases but it could explain everything or nothing! (For the record a Fortran routine was calling a C routine with 2 parameters when 3 were required. The normal check I do of matching actual and dummy parameters does not cover calling C from Fortran.) I am praying that this will explain the anomalous results. Keeping my fingers crossed and thanks for your help and support. Eric. |
©2024 CERN