Message boards : Number crunching : Invalid tasks
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
tom310

Send message
Joined: 28 Aug 12
Posts: 15
Credit: 500,336
RAC: 0
Message 26219 - Posted: 26 Feb 2014, 19:21:36 UTC
Last modified: 26 Feb 2014, 20:16:32 UTC

Hi,

I have two invalid tasks and one inconclusive, that probably turns out to be invalid. They were downloaded at the same time on two different computers, that are connected to the same modem. The tasks were returned at different times. Both computers have valid tasks before and after that as well as valid non-LHC tasks.

invalid
http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733989
http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733914

inconclusive
http://lhcathomeclassic.cern.ch/sixtrack/result.php?resultid=28733987

Could this be a download error? Is there any validation of a file after it was downloaded?

Tom

Edit: I just realized that all three are calculated using v466.03, while the other computers used v466.05. But also I have some workunits that show no discrepancy between v466.03 and v466.05.
ID: 26219 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26222 - Posted: 27 Feb 2014, 16:55:11 UTC - in response to Message 26219.  

Should be OK; difference was just in BOINC API which should
not affect the numerics. Looking into it. Eric.
ID: 26222 · Report as offensive     Reply Quote
tom310

Send message
Joined: 28 Aug 12
Posts: 15
Credit: 500,336
RAC: 0
Message 26224 - Posted: 27 Feb 2014, 20:08:45 UTC - in response to Message 26222.  

Hi Eric,

I got another one. Here two v446.03 kicked out the v446.05.

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13506238

Ha! I got 4 credits for that one as indemnification.

Greetings
Tom
ID: 26224 · Report as offensive     Reply Quote
tom310

Send message
Joined: 28 Aug 12
Posts: 15
Credit: 500,336
RAC: 0
Message 26225 - Posted: 28 Feb 2014, 19:03:59 UTC - in response to Message 26222.  
Last modified: 28 Feb 2014, 19:04:21 UTC

Hi Eric, here are two more

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13524328
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13521217

Looks like there is a discrepancy between v466.03 and v466.05.

Tom
ID: 26225 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26226 - Posted: 4 Mar 2014, 5:38:42 UTC - in response to Message 26225.  
Last modified: 4 Mar 2014, 6:02:38 UTC

... Looks like there is a discrepancy between v466.03 and v466.05.

Tom

From what I can see it's often (not always) pni vs. sse3 Linux vs. Windows, even if both have the same minor release number. Both x86 and x64 are affected, the result duration plays no role, it happens with very short and with longer results.

edit : about the same rate as reported here, currently ~2% for my results. The value isn't very exact as the results are purged quite soon after the workunit is completed.

edit2 : I checked a few of my valid results and found a bunch of workunits where a pni/windows validated fine against my sse3/windows but a linux result was invalid. So it must really be a Linux vs. Windows issue.
ID: 26226 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 26236 - Posted: 7 Mar 2014, 19:50:17 UTC
Last modified: 7 Mar 2014, 20:30:35 UTC

I've rarely had any Invalids but spotted this one where 2 different flavours of Linux (and hence 446.05) validated against each other but not against my Windows (446.03). Could it be that where the first 2 returns are one Linux and one Windows, the wu will validate against whichever OS the third return is, leaving the odd one as Invalid? Possibly with more hosts being Windows, the Linux hosts will more often be the odd-one-out and therefore show more Invalids.
I currently don't have any Inconclusives to test that but I'll keep a look out. Could be pure coincidence.

2 WUs I have running for me to watch:
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=13720868
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=14175670
Both Inconclusive just now as returned results are 1 Linux 446.05 and 1 Windows 446.03. I will finish 1 in a couple of hours but the other will be 2moro. If both validate against the earlier Windows return and leave the Linux return Invalid then there's evidence to support what I said above.
ID: 26236 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26238 - Posted: 8 Mar 2014, 3:25:16 UTC

Thanks for the additional feedback. I am working on this,
and have been for some time. (On vacation for two weeks
right now and can't access Windows though.) The source code
of these two versions is identical.....i have to believe that this
is an ifort Windows compiler issue. Further the failure rate is
low, but significant, so it is likely a data dependent compatibility issue.
All my tests at CERN work fine though. When I get back I shall
pull the relevant executables from the BOINC server and rerun
these cases again and again. My suspicion is definitely, as you
suggest, a Linux/Windows difference. Right now I am searching the
executables to see if I can find the relevant compiler versions.
I used to build the executables myself but since my Windows XP
machine died I have been forced to Windows 7 and my colleagues
build the Windows executables which I then test. I would introduce
new executables Version 4508 but right now we have major problems
with the Windows executables crashing with apparently the "same"
BOINC API. For about two years i could not upgrade the ifort compiler
due to results differences.....total numeric compatibility is tough!
I shall of course add one or more of these cases to my tests when
the issue is resolved. (I am really motivated on this as I cannot publish
while such unexplained differences exist.) My tests verify that results
are identical for ifort O0, O1, O2 sse2, and O2 sse3 and ia32 on Linux,
Windows (and MAC). Sadly ifort on Linux and Windows versions are
often significantly different. I don't want to change from ifort right now
as it would require much more testing of the API and investigation of
performance. At least my current test suite finds the majority of
problems. Down but not out. Eric.


ID: 26238 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26239 - Posted: 8 Mar 2014, 3:25:53 UTC - in response to Message 26236.  

Thanks Ray; my thoughts exactly. Eric.
ID: 26239 · Report as offensive     Reply Quote
tom310

Send message
Joined: 28 Aug 12
Posts: 15
Credit: 500,336
RAC: 0
Message 26240 - Posted: 8 Mar 2014, 4:14:26 UTC - in response to Message 26238.  

Enjoy your vacation, you deserve it. The project has produced 2 billion credits since August 2012, 10 times more than in all the years before. Coincidently, I joined in August 2012. I should have some vacation too.
:-)
ID: 26240 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 26241 - Posted: 8 Mar 2014, 8:56:55 UTC - in response to Message 26238.  
Last modified: 8 Mar 2014, 9:10:50 UTC

Thanks for that update, Eric. Enjoy your break.

All my 10^5 turn, and early finishing, tasks validate regardless of wingman OS and the problem only seems to show up with the longer 10^6 turn tasks that run to full term.
As expected, both examples above validated Windows-Windows leaving an Invalid Linux.
I have 2 waiting for Linux wingmen which I expect to be inconclusive and would ideally like the 3rd sending to go to another Linux host to leave mine as the Invalid one to test this odd-one-out theory. I'm already the 3rd sending for a couple of Win-Lin tasks which will likely validate with the other Windows wingman but it would be useful to see other examples where Lin-Lin-Win gives the invalid in the same way as Win-Win-Lin.

I do, however have a Windows-Linux validation here.
ID: 26241 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26242 - Posted: 8 Mar 2014, 9:13:39 UTC - in response to Message 26240.  

Thanks Tom; that is encouraging indeed.
I am NOT giving up on this; there is always an
explication but sometimes difficult to find!

Eric.
ID: 26242 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26243 - Posted: 8 Mar 2014, 9:15:26 UTC - in response to Message 26241.  

Useful Ray; that particular valid case appears to be nobb i.e.no
beam-beam interaction which might be a clue. Eric.
ID: 26243 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 26244 - Posted: 8 Mar 2014, 10:01:21 UTC - in response to Message 26243.  
Last modified: 8 Mar 2014, 10:06:03 UTC

Just so you don't go looking down the wrong track, Eric; here, and here are just a couple where I'm the 3rd sending for nobb jobs that failed to validate Win-Lin.

Anyway, put that keyboard away. You're supposed to be on holiday 8¬)


[corrected links]
ID: 26244 · Report as offensive     Reply Quote
tgoti

Send message
Joined: 29 Dec 11
Posts: 1
Credit: 1,891,171
RAC: 0
Message 26245 - Posted: 8 Mar 2014, 11:20:47 UTC - in response to Message 26244.  

Just to add to this thread, I have two which where flagged invalid because the Linux where winning. (here and here).

And two more coming up where one will be probably correct and the other again invalid since it is checked against a Linux system.
ID: 26245 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,233,003
RAC: 105,610
Message 26249 - Posted: 9 Mar 2014, 1:03:17 UTC

I think I'm seeing the same as tgoti, I have invalids vs Linux host and OK against other windows hosts.

Check my computers for a whole bunch of tasks.
ID: 26249 · Report as offensive     Reply Quote
tom310

Send message
Joined: 28 Aug 12
Posts: 15
Credit: 500,336
RAC: 0
Message 26259 - Posted: 10 Mar 2014, 22:08:45 UTC - in response to Message 26241.  

Hi Eric, hi Ray,

I had one or two 10^5 WUs that were ruled out, but the error might have a 10 times higher probability with 10 times longer WUs. I also had a bunch of non-inconclusive tasks with a Win/Linux wingmanship.
Is it possible to take some of the "partially inconclusive" tasks and rerun them with fewer and fewer turns (bisection method) to narrow down the error? It looks reproducible to me.

Greetings
Tom

PS: At least we have vacation-like weather here.
ID: 26259 · Report as offensive     Reply Quote
[AF>FAH-Addict.net]toTOW

Send message
Joined: 9 Oct 10
Posts: 77
Credit: 3,671,357
RAC: 0
Message 26263 - Posted: 11 Mar 2014, 9:38:04 UTC

Hi Eric,

I'm seeing the same behaviour as described in this thread. The error rate accross my machines is quite low anyway (about 1%).

After your holidays, I hope you'll be able to figure out what happens :)
ID: 26263 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26264 - Posted: 12 Mar 2014, 4:34:40 UTC

Posting to thread to inform everyone.
I have managed to grab the input for a couple
of the relevant cases. I am running them right
now on Linux at CERN. Sadly I cannot test Windows
while on vacation. I am keeping an open mind but
I now reckon an ifort problem. Had them before.
We shall see. It is really strange that the version
numbers are different between Linux and Windows,
but that may well have been a compiler issue indeed.
(By the way I cannot undrestand why Intel has different
Code generation between Linux/Windows. A different OS
interface OK....but why code generation which is surely
just hardware dependent!)

I have appended some notes for your edification/amusement.
If it is not the compiler I have a procedure whereby I identify
the turn where the first diff arises, then the element, and then
each step processing that element. All in HEX as the first
difference is often only 1 ULP. Thanks for all the feedback.
Eric.

Extract from notes:

SixTrack Version: 4.4.67 Eric
-- Just testing gfortran O4
McIntosh 26th September, 2013

SixTrack Version: 4.4.66 Eric
-- Just testing ifort 2013 with O1 (ia32, sse2 and sse3)
McIntosh 20th September, 2013

SixTrack Version: 4.4.65 Eric
-- Re-build with new boinc libs from server_stable (Riccardo)
McIntosh 18th September, 2013

SixTrack Version: 4.4.64 Eric
-- Just testing new ifort 2013 (usevnewifort)
McIntosh 16th September, 2013

SixTrack Version: 4.4.63 Eric
-- Just added a call to boincrf for fort.zip for Windows.
McIntosh 31st August, 2013

SixTrack Version: 4.4.62 Eric
-- Fixed nasty bug in daten iclr.eq.2 concerning exz
in tracking input
-- Fixing all "formatted read" for fio and _xp stuff.
-- dabnew.s and sixtrack.s. if fio (but NOT Lahey lf95)
then enable/disable _xp before/after the READ "NEAREST"
fio overrules crlibm for formatted input
McIntosh 29th September, 2013



ID: 26264 · Report as offensive     Reply Quote
henry

Send message
Joined: 15 Sep 13
Posts: 73
Credit: 5,763
RAC: 0
Message 26267 - Posted: 12 Mar 2014, 11:56:33 UTC - in response to Message 26264.  
Last modified: 12 Mar 2014, 11:57:26 UTC

(By the way I cannot undrestand why Intel has different
Code generation between Linux/Windows.)


Why not? Who in their right mind compares results from Linux with results from Windows to 6 decimal places of precision anyway? Well, OK, BOINCers do but those who find the results differ between OSs just do the smart thing and run the app in a virtual machine.

Your lifelong dream might have had some relevance 20 years ago but nowadays it's about as relevant as building a better steam locomotive... nobody needs one, nobody wants one, nobody will use it in a serious application even if you give them one for free.

Do the sensible thing which is to merge this project with T4T. Let them issue the Linux app to all hosts then find something useful to do with the rest of your life instead of wasting your talent on an anachronism.
6 bangers are for wussies.
ID: 26267 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26268 - Posted: 12 Mar 2014, 12:39:43 UTC - in response to Message 26267.  

Hi Henry; you have said this before and I am not going to spend
time in a flame war. I'll just try and answer your questions when
and if I manage to publish. You might read some of Prof. Kahan's
papers. Why do we have standards? Why have different code generation?
Intel wants to make a profit; what a waste of effort. And all the effort lost
by users debugging. The 15, yes fifteen, decimal digits are required.
We are very concerned about the accumulation of floating-point
error and hope to study it very soon in SixTrack. Incidentally games
programmers are very interested in getting identical results across
platforms too, but in fact many BOINC applications don't need it.
Just to say I take your points and I shall have to try and answer them.
As I said before virtual machines would help ME but do NOT solve
underlying hardware issues. Eric.
ID: 26268 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Invalid tasks


©2024 CERN