Message boards :
Number crunching :
Invalid tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0 |
Hi, I have two invalid tasks and one inconclusive, that probably turns out to be invalid. They were downloaded at the same time on two different computers, that are connected to the same modem. The tasks were returned at different times. Both computers have valid tasks before and after that as well as valid non-LHC tasks. invalid http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733989 http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733914 inconclusive http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733987 Could this be a download error? Is there any validation of a file after it was downloaded? Tom Edit: I just realized that all three are calculated using v466.03, while the other computers used v466.05. But also I have some workunits that show no discrepancy between v466.03 and v466.05. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Should be OK; difference was just in BOINC API which should not affect the numerics. Looking into it. Eric. |
Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0 |
Hi Eric, I got another one. Here two v446.03 kicked out the v446.05. http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13506238 Ha! I got 4 credits for that one as indemnification. Greetings Tom |
Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0 |
Hi Eric, here are two more http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13524328 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13521217 Looks like there is a discrepancy between v466.03 and v466.05. Tom |
Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0 |
... Looks like there is a discrepancy between v466.03 and v466.05. From what I can see it's often (not always) edit : about the same rate as reported here, currently ~2% for my results. The value isn't very exact as the results are purged quite soon after the workunit is completed. edit2 : I checked a few of my valid results and found a bunch of workunits where a pni/windows validated fine against my sse3/windows but a linux result was invalid. So it must really be a Linux vs. Windows issue. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
I've rarely had any Invalids but spotted this one where 2 different flavours of Linux (and hence 446.05) validated against each other but not against my Windows (446.03). Could it be that where the first 2 returns are one Linux and one Windows, the wu will validate against whichever OS the third return is, leaving the odd one as Invalid? Possibly with more hosts being Windows, the Linux hosts will more often be the odd-one-out and therefore show more Invalids. I currently don't have any Inconclusives to test that but I'll keep a look out. Could be pure coincidence. 2 WUs I have running for me to watch: http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13720868 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=14175670 Both Inconclusive just now as returned results are 1 Linux 446.05 and 1 Windows 446.03. I will finish 1 in a couple of hours but the other will be 2moro. If both validate against the earlier Windows return and leave the Linux return Invalid then there's evidence to support what I said above. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks for the additional feedback. I am working on this, and have been for some time. (On vacation for two weeks right now and can't access Windows though.) The source code of these two versions is identical.....i have to believe that this is an ifort Windows compiler issue. Further the failure rate is low, but significant, so it is likely a data dependent compatibility issue. All my tests at CERN work fine though. When I get back I shall pull the relevant executables from the BOINC server and rerun these cases again and again. My suspicion is definitely, as you suggest, a Linux/Windows difference. Right now I am searching the executables to see if I can find the relevant compiler versions. I used to build the executables myself but since my Windows XP machine died I have been forced to Windows 7 and my colleagues build the Windows executables which I then test. I would introduce new executables Version 4508 but right now we have major problems with the Windows executables crashing with apparently the "same" BOINC API. For about two years i could not upgrade the ifort compiler due to results differences.....total numeric compatibility is tough! I shall of course add one or more of these cases to my tests when the issue is resolved. (I am really motivated on this as I cannot publish while such unexplained differences exist.) My tests verify that results are identical for ifort O0, O1, O2 sse2, and O2 sse3 and ia32 on Linux, Windows (and MAC). Sadly ifort on Linux and Windows versions are often significantly different. I don't want to change from ifort right now as it would require much more testing of the API and investigation of performance. At least my current test suite finds the majority of problems. Down but not out. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks Ray; my thoughts exactly. Eric. |
Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0 |
Enjoy your vacation, you deserve it. The project has produced 2 billion credits since August 2012, 10 times more than in all the years before. Coincidently, I joined in August 2012. I should have some vacation too. :-) |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
Thanks for that update, Eric. Enjoy your break. All my 10^5 turn, and early finishing, tasks validate regardless of wingman OS and the problem only seems to show up with the longer 10^6 turn tasks that run to full term. As expected, both examples above validated Windows-Windows leaving an Invalid Linux. I have 2 waiting for Linux wingmen which I expect to be inconclusive and would ideally like the 3rd sending to go to another Linux host to leave mine as the Invalid one to test this odd-one-out theory. I'm already the 3rd sending for a couple of Win-Lin tasks which will likely validate with the other Windows wingman but it would be useful to see other examples where Lin-Lin-Win gives the invalid in the same way as Win-Win-Lin. I do, however have a Windows-Linux validation here. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks Tom; that is encouraging indeed. I am NOT giving up on this; there is always an explication but sometimes difficult to find! Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Useful Ray; that particular valid case appears to be nobb i.e.no beam-beam interaction which might be a clue. Eric. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
|
Send message Joined: 29 Dec 11 Posts: 1 Credit: 1,891,171 RAC: 0 |
|
Send message Joined: 27 Sep 08 Posts: 830 Credit: 687,973,588 RAC: 173,817 |
I think I'm seeing the same as tgoti, I have invalids vs Linux host and OK against other windows hosts. Check my computers for a whole bunch of tasks. |
Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0 |
Hi Eric, hi Ray, I had one or two 10^5 WUs that were ruled out, but the error might have a 10 times higher probability with 10 times longer WUs. I also had a bunch of non-inconclusive tasks with a Win/Linux wingmanship. Is it possible to take some of the "partially inconclusive" tasks and rerun them with fewer and fewer turns (bisection method) to narrow down the error? It looks reproducible to me. Greetings Tom PS: At least we have vacation-like weather here. |
Send message Joined: 9 Oct 10 Posts: 77 Credit: 3,671,357 RAC: 0 |
Hi Eric, I'm seeing the same behaviour as described in this thread. The error rate accross my machines is quite low anyway (about 1%). After your holidays, I hope you'll be able to figure out what happens :) |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Posting to thread to inform everyone. I have managed to grab the input for a couple of the relevant cases. I am running them right now on Linux at CERN. Sadly I cannot test Windows while on vacation. I am keeping an open mind but I now reckon an ifort problem. Had them before. We shall see. It is really strange that the version numbers are different between Linux and Windows, but that may well have been a compiler issue indeed. (By the way I cannot undrestand why Intel has different Code generation between Linux/Windows. A different OS interface OK....but why code generation which is surely just hardware dependent!) I have appended some notes for your edification/amusement. If it is not the compiler I have a procedure whereby I identify the turn where the first diff arises, then the element, and then each step processing that element. All in HEX as the first difference is often only 1 ULP. Thanks for all the feedback. Eric. Extract from notes: SixTrack Version: 4.4.67 Eric -- Just testing gfortran O4 McIntosh 26th September, 2013 SixTrack Version: 4.4.66 Eric -- Just testing ifort 2013 with O1 (ia32, sse2 and sse3) McIntosh 20th September, 2013 SixTrack Version: 4.4.65 Eric -- Re-build with new boinc libs from server_stable (Riccardo) McIntosh 18th September, 2013 SixTrack Version: 4.4.64 Eric -- Just testing new ifort 2013 (usevnewifort) McIntosh 16th September, 2013 SixTrack Version: 4.4.63 Eric -- Just added a call to boincrf for fort.zip for Windows. McIntosh 31st August, 2013 SixTrack Version: 4.4.62 Eric -- Fixed nasty bug in daten iclr.eq.2 concerning exz in tracking input -- Fixing all "formatted read" for fio and _xp stuff. -- dabnew.s and sixtrack.s. if fio (but NOT Lahey lf95) then enable/disable _xp before/after the READ "NEAREST" fio overrules crlibm for formatted input McIntosh 29th September, 2013 |
Send message Joined: 15 Sep 13 Posts: 73 Credit: 5,763 RAC: 0 |
(By the way I cannot undrestand why Intel has different Why not? Who in their right mind compares results from Linux with results from Windows to 6 decimal places of precision anyway? Well, OK, BOINCers do but those who find the results differ between OSs just do the smart thing and run the app in a virtual machine. Your lifelong dream might have had some relevance 20 years ago but nowadays it's about as relevant as building a better steam locomotive... nobody needs one, nobody wants one, nobody will use it in a serious application even if you give them one for free. Do the sensible thing which is to merge this project with T4T. Let them issue the Linux app to all hosts then find something useful to do with the rest of your life instead of wasting your talent on an anachronism. 6 bangers are for wussies. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Hi Henry; you have said this before and I am not going to spend time in a flame war. I'll just try and answer your questions when and if I manage to publish. You might read some of Prof. Kahan's papers. Why do we have standards? Why have different code generation? Intel wants to make a profit; what a waste of effort. And all the effort lost by users debugging. The 15, yes fifteen, decimal digits are required. We are very concerned about the accumulation of floating-point error and hope to study it very soon in SixTrack. Incidentally games programmers are very interested in getting identical results across platforms too, but in fact many BOINC applications don't need it. Just to say I take your points and I shall have to try and answer them. As I said before virtual machines would help ME but do NOT solve underlying hardware issues. Eric. |
©2024 CERN