Thread 'Invalid tasks'

Author	Message
tom310 Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0	Message 26219 - Posted: 26 Feb 2014, 19:21:36 UTC Last modified: 26 Feb 2014, 20:16:32 UTC Hi, I have two invalid tasks and one inconclusive, that probably turns out to be invalid. They were downloaded at the same time on two different computers, that are connected to the same modem. The tasks were returned at different times. Both computers have valid tasks before and after that as well as valid non-LHC tasks. invalid http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733989 http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733914 inconclusive http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733987 Could this be a download error? Is there any validation of a file after it was downloaded? Tom Edit: I just realized that all three are calculated using v466.03, while the other computers used v466.05. But also I have some workunits that show no discrepancy between v466.03 and v466.05. ID: 26219 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26222 - Posted: 27 Feb 2014, 16:55:11 UTC - in response to Message 26219. Should be OK; difference was just in BOINC API which should not affect the numerics. Looking into it. Eric. ID: 26222 · Reply Quote

tom310 Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0	Message 26224 - Posted: 27 Feb 2014, 20:08:45 UTC - in response to Message 26222. Hi Eric, I got another one. Here two v446.03 kicked out the v446.05. http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13506238 Ha! I got 4 credits for that one as indemnification. Greetings Tom ID: 26224 · Reply Quote

tom310 Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0	Message 26225 - Posted: 28 Feb 2014, 19:03:59 UTC - in response to Message 26222. Last modified: 28 Feb 2014, 19:04:21 UTC Hi Eric, here are two more http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13524328 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13521217 Looks like there is a discrepancy between v466.03 and v466.05. Tom ID: 26225 · Reply Quote

Ananas Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0	Message 26226 - Posted: 4 Mar 2014, 5:38:42 UTC - in response to Message 26225. Last modified: 4 Mar 2014, 6:02:38 UTC ... Looks like there is a discrepancy between v466.03 and v466.05. Tom From what I can see it's often (not always) ~~pni vs. sse3~~ Linux vs. Windows, even if both have the same minor release number. Both x86 and x64 are affected, the result duration plays no role, it happens with very short and with longer results. edit : about the same rate as reported here, currently ~2% for my results. The value isn't very exact as the results are purged quite soon after the workunit is completed. edit2 : I checked a few of my valid results and found a bunch of workunits where a pni/windows validated fine against my sse3/windows but a linux result was invalid. So it must really be a Linux vs. Windows issue. ID: 26226 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 26236 - Posted: 7 Mar 2014, 19:50:17 UTC Last modified: 7 Mar 2014, 20:30:35 UTC I've rarely had any Invalids but spotted this one where 2 different flavours of Linux (and hence 446.05) validated against each other but not against my Windows (446.03). Could it be that where the first 2 returns are one Linux and one Windows, the wu will validate against whichever OS the third return is, leaving the odd one as Invalid? Possibly with more hosts being Windows, the Linux hosts will more often be the odd-one-out and therefore show more Invalids. I currently don't have any Inconclusives to test that but I'll keep a look out. Could be pure coincidence. 2 WUs I have running for me to watch: http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13720868 http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=14175670 Both Inconclusive just now as returned results are 1 Linux 446.05 and 1 Windows 446.03. I will finish 1 in a couple of hours but the other will be 2moro. If both validate against the earlier Windows return and leave the Linux return Invalid then there's evidence to support what I said above. ID: 26236 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26238 - Posted: 8 Mar 2014, 3:25:16 UTC Thanks for the additional feedback. I am working on this, and have been for some time. (On vacation for two weeks right now and can't access Windows though.) The source code of these two versions is identical.....i have to believe that this is an ifort Windows compiler issue. Further the failure rate is low, but significant, so it is likely a data dependent compatibility issue. All my tests at CERN work fine though. When I get back I shall pull the relevant executables from the BOINC server and rerun these cases again and again. My suspicion is definitely, as you suggest, a Linux/Windows difference. Right now I am searching the executables to see if I can find the relevant compiler versions. I used to build the executables myself but since my Windows XP machine died I have been forced to Windows 7 and my colleagues build the Windows executables which I then test. I would introduce new executables Version 4508 but right now we have major problems with the Windows executables crashing with apparently the "same" BOINC API. For about two years i could not upgrade the ifort compiler due to results differences.....total numeric compatibility is tough! I shall of course add one or more of these cases to my tests when the issue is resolved. (I am really motivated on this as I cannot publish while such unexplained differences exist.) My tests verify that results are identical for ifort O0, O1, O2 sse2, and O2 sse3 and ia32 on Linux, Windows (and MAC). Sadly ifort on Linux and Windows versions are often significantly different. I don't want to change from ifort right now as it would require much more testing of the API and investigation of performance. At least my current test suite finds the majority of problems. Down but not out. Eric. ID: 26238 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26239 - Posted: 8 Mar 2014, 3:25:53 UTC - in response to Message 26236. Thanks Ray; my thoughts exactly. Eric. ID: 26239 · Reply Quote

tom310 Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0	Message 26240 - Posted: 8 Mar 2014, 4:14:26 UTC - in response to Message 26238. Enjoy your vacation, you deserve it. The project has produced 2 billion credits since August 2012, 10 times more than in all the years before. Coincidently, I joined in August 2012. I should have some vacation too. :-) ID: 26240 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 26241 - Posted: 8 Mar 2014, 8:56:55 UTC - in response to Message 26238. Last modified: 8 Mar 2014, 9:10:50 UTC Thanks for that update, Eric. Enjoy your break. All my 10^5 turn, and early finishing, tasks validate regardless of wingman OS and the problem only seems to show up with the longer 10^6 turn tasks that run to full term. As expected, both examples above validated Windows-Windows leaving an Invalid Linux. I have 2 waiting for Linux wingmen which I expect to be inconclusive and would ideally like the 3rd sending to go to another Linux host to leave mine as the Invalid one to test this odd-one-out theory. I'm already the 3rd sending for a couple of Win-Lin tasks which will likely validate with the other Windows wingman but it would be useful to see other examples where Lin-Lin-Win gives the invalid in the same way as Win-Win-Lin. I do, however have a Windows-Linux validation here. ID: 26241 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26242 - Posted: 8 Mar 2014, 9:13:39 UTC - in response to Message 26240. Thanks Tom; that is encouraging indeed. I am NOT giving up on this; there is always an explication but sometimes difficult to find! Eric. ID: 26242 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26243 - Posted: 8 Mar 2014, 9:15:26 UTC - in response to Message 26241. Useful Ray; that particular valid case appears to be nobb i.e.no beam-beam interaction which might be a clue. Eric. ID: 26243 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 26244 - Posted: 8 Mar 2014, 10:01:21 UTC - in response to Message 26243. Last modified: 8 Mar 2014, 10:06:03 UTC Just so you don't go looking down the wrong track, Eric; here, and here are just a couple where I'm the 3rd sending for nobb jobs that failed to validate Win-Lin. Anyway, put that keyboard away. You're supposed to be on holiday 8Â¬) [corrected links] ID: 26244 · Reply Quote

tgoti Send message Joined: 29 Dec 11 Posts: 1 Credit: 1,891,171 RAC: 0	Message 26245 - Posted: 8 Mar 2014, 11:20:47 UTC - in response to Message 26244. Just to add to this thread, I have two which where flagged invalid because the Linux where winning. (here and here). And two more coming up where one will be probably correct and the other again invalid since it is checked against a Linux system. ID: 26245 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 899 Credit: 773,037,495 RAC: 211,323	Message 26249 - Posted: 9 Mar 2014, 1:03:17 UTC I think I'm seeing the same as tgoti, I have invalids vs Linux host and OK against other windows hosts. Check my computers for a whole bunch of tasks. ID: 26249 · Reply Quote

tom310 Send message Joined: 28 Aug 12 Posts: 15 Credit: 500,336 RAC: 0	Message 26259 - Posted: 10 Mar 2014, 22:08:45 UTC - in response to Message 26241. Hi Eric, hi Ray, I had one or two 10^5 WUs that were ruled out, but the error might have a 10 times higher probability with 10 times longer WUs. I also had a bunch of non-inconclusive tasks with a Win/Linux wingmanship. Is it possible to take some of the "partially inconclusive" tasks and rerun them with fewer and fewer turns (bisection method) to narrow down the error? It looks reproducible to me. Greetings Tom PS: At least we have vacation-like weather here. ID: 26259 · Reply Quote

[AF>FAH-Addict.net]toTOW Send message Joined: 9 Oct 10 Posts: 77 Credit: 3,727,865 RAC: 0	Message 26263 - Posted: 11 Mar 2014, 9:38:04 UTC Hi Eric, I'm seeing the same behaviour as described in this thread. The error rate accross my machines is quite low anyway (about 1%). After your holidays, I hope you'll be able to figure out what happens :) ID: 26263 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26264 - Posted: 12 Mar 2014, 4:34:40 UTC Posting to thread to inform everyone. I have managed to grab the input for a couple of the relevant cases. I am running them right now on Linux at CERN. Sadly I cannot test Windows while on vacation. I am keeping an open mind but I now reckon an ifort problem. Had them before. We shall see. It is really strange that the version numbers are different between Linux and Windows, but that may well have been a compiler issue indeed. (By the way I cannot undrestand why Intel has different Code generation between Linux/Windows. A different OS interface OK....but why code generation which is surely just hardware dependent!) I have appended some notes for your edification/amusement. If it is not the compiler I have a procedure whereby I identify the turn where the first diff arises, then the element, and then each step processing that element. All in HEX as the first difference is often only 1 ULP. Thanks for all the feedback. Eric. Extract from notes: SixTrack Version: 4.4.67 Eric -- Just testing gfortran O4 McIntosh 26th September, 2013 SixTrack Version: 4.4.66 Eric -- Just testing ifort 2013 with O1 (ia32, sse2 and sse3) McIntosh 20th September, 2013 SixTrack Version: 4.4.65 Eric -- Re-build with new boinc libs from server_stable (Riccardo) McIntosh 18th September, 2013 SixTrack Version: 4.4.64 Eric -- Just testing new ifort 2013 (usevnewifort) McIntosh 16th September, 2013 SixTrack Version: 4.4.63 Eric -- Just added a call to boincrf for fort.zip for Windows. McIntosh 31st August, 2013 SixTrack Version: 4.4.62 Eric -- Fixed nasty bug in daten iclr.eq.2 concerning exz in tracking input -- Fixing all "formatted read" for fio and _xp stuff. -- dabnew.s and sixtrack.s. if fio (but NOT Lahey lf95) then enable/disable _xp before/after the READ "NEAREST" fio overrules crlibm for formatted input McIntosh 29th September, 2013 ID: 26264 · Reply Quote

henry Send message Joined: 15 Sep 13 Posts: 73 Credit: 5,763 RAC: 0	Message 26267 - Posted: 12 Mar 2014, 11:56:33 UTC - in response to Message 26264. Last modified: 12 Mar 2014, 11:57:26 UTC (By the way I cannot undrestand why Intel has different Code generation between Linux/Windows.) Why not? Who in their right mind compares results from Linux with results from Windows to 6 decimal places of precision anyway? Well, OK, BOINCers do but those who find the results differ between OSs just do the smart thing and run the app in a virtual machine. Your lifelong dream might have had some relevance 20 years ago but nowadays it's about as relevant as building a better steam locomotive... nobody needs one, nobody wants one, nobody will use it in a serious application even if you give them one for free. Do the sensible thing which is to merge this project with T4T. Let them issue the Linux app to all hosts then find something useful to do with the rest of your life instead of wasting your talent on an anachronism. 6 bangers are for wussies. ID: 26267 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26268 - Posted: 12 Mar 2014, 12:39:43 UTC - in response to Message 26267. Hi Henry; you have said this before and I am not going to spend time in a flame war. I'll just try and answer your questions when and if I manage to publish. You might read some of Prof. Kahan's papers. Why do we have standards? Why have different code generation? Intel wants to make a profit; what a waste of effort. And all the effort lost by users debugging. The 15, yes fifteen, decimal digits are required. We are very concerned about the accumulation of floating-point error and hope to study it very soon in SixTrack. Incidentally games programmers are very interested in getting identical results across platforms too, but in fact many BOINC applications don't need it. Just to say I take your points and I shall have to try and answer them. As I said before virtual machines would help ME but do NOT solve underlying hardware issues. Eric. ID: 26268 · Reply Quote