Message boards :
Number crunching :
SixTrack: Cross-OS validation = inconclusive results
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Sep 04 Posts: 30 Credit: 5,100,929 RAC: 0 |
Good morning! :) Please check out this inconclusive SixTrack result where the server tried to compare a Windows vs. a Linux result: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=69568938 Isn't it correct that different OS may yield slightly different numerical results for the same calculation and therefore it is advised to only cross-compare / validate Windows vs. Windows and Linux vs. Linux (as is the case for validation of only AMD vs. AMD and Intel vs. Intel CPU results for reasons of different float number rounding)? If this is the case, WUs to be validated by redundant computation should only be handed out to the same OS, otherwise computational ressources are going to be wasted. Michael. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thankyou Michael, this is a case very interesting to me. SixTrack DOES provide complete numeric portability, 0 ULP difference, across OS and with different compilers and options. (I hope to publish in the next months.) However invalid results can still be produced by, say, an overclocked machine or a hardware error. I am also trying to study the very small percentage of numerical error results, where I suspect very fast large memory systems. More news when I have had a look. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Computer ID 10481733 does not have any valid results. All invalid or inconclusive. I am now trying to stop sending work to it and contacting the owner. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
OK, I have stopped sending this host 10481733 new work and I have e-mailed the user. Another task is already running the same case so this should be sorted very soon. Task seems to run for about 4 hours BUT the invalid runs for only 20 seconds! This is very strange and is not a small numeric difference. We shall see. (For the record, I am now 76, can't remember all details, have just moved office and can't find anything, but managed to find the method of banning which is to set max work units to -1 as I was told some years ago by Toby Broom I think :-) Eric |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Correction; the host has 14 valid and 53 invalid. Eric. |
Send message Joined: 18 Sep 04 Posts: 30 Credit: 5,100,929 RAC: 0 |
Thank you Eric for looking into it in such great detail. :) Michael. P.S.: Generally, none of my machines is overclocked. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well, I am anyway trying to make an analysis of the work done from 19th May until today, which I shall publish here of course. I think your machine is fine Michael. Eric. |
Send message Joined: 26 Nov 09 Posts: 1 Credit: 78,692 RAC: 0 |
Hello, owner of host 10481733 here. I am in touch with Eric via email and we are working on figuring out what is wrong with my machine. I do not overclock any of my hardware, and everything is running on an SSD, so we can rule out the overclocking hypothesis for my particular machine. However, that does not still rule out the possibility of a hardware issue in calculation. If there is a preferable test to be done on linux to check the reliability of the calculations, please let me know and I will perform the test. I do have completed and validated work units on several other projects. I know that is not an apples-to-apples comparison, but it still stands to reason that if the issue were hardware related, I would likely have several other failed results in other projects as well. Any ideas are greatly appreciated. Thank you, -Kevin |
Send message Joined: 18 Sep 04 Posts: 30 Credit: 5,100,929 RAC: 0 |
I do have completed and validated work units on several other projects. I know that is not an apples-to-apples comparison, but it still stands to reason that if the issue were hardware related, I would likely have several other failed results in other projects as well. That is not necessarily the case. I have two GPUs that work error-free with all tested DC projects, except for one. AND: On one of these cards, that project incompatibility has not even been there from the very beginning. It started after about half a year of constant computation. Hardware-wise, still no flaws can be detected. But something on that equipment must have changed over time and it did not appear to be just dust accumulation causing overheating - as judged on the basis of chip temperature sensors. Vice-versa, there are hardware testing programs which may output potential failures (e.g. GPU RAM) but these appear false-positives as despite the reported issue, the DC results produced with those boards are successfuly cross-validated. However, it might e.g. also be that some projects use more RAM than others such that a faulty area on the memory chips is only used by some DC apps. Another possibility: With a very popular Linux distribution, for quite a while, simply a faulty CPU-RAM test was delivered, and so on... All possible... I use many machines, CPUs and GPUs, for more than 15 years in distributed computing on a 24/7 basis. So, I have seen quite a few apparently strange things in this respect. One of the problems with an accurate diagnosis on a scientific basis is the fact that many DC projects often just are not very helpful because they do not care about a proposed single users issue. One example: Currently I have a machine which used to contribute to a DC project for more than a year using its GTX 770 card. From one day to the other that machine ceased receiving work for no obvious reason. To date, the project leader board did not manage to help solving the problem. Well, its their project and they now have one constantly contributing machine less. Michael. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well, this is a rather complex topic. I am working on an analysis of the recent Pentathlon results and hope I will end up with, amongst other things, a procedure to run regularly to identify "bad" SixTrack hosts. In the meantime here are some numbers from 2015 showing some data for (perhaps not all) hosts with at least one invalid. I used this at the time to "ban" "bad" hosts. Note that SixTrack is extremely Fl. Pt intensive, and may cause overheating especially when running multiple copies. There is also the question of random memory failures: do all mother boards have SECDED?, cosmic rays!, etc etc Eric. Results based on database 31st March, 2015 but only looking at hosts with previous invalids in my log. Totals 153287 2504 134690 16093 HostID Total Invalid Valid Don't know 9996388 45057 870 44185 2 10334649 291 210 48 33 10353795 75 56 5 14 10338771 55 47 7 1 10151727 170 43 78 49 10352446 43 42 0 1 10352445 40 37 2 1 10348627 507 30 412 65 10332524 65 30 22 13 10325398 37 30 4 3 8424458 234 29 156 49 10332528 59 29 15 15 10282106 176 24 123 29 10340508 42 21 20 1 10311622 53 20 0 33 10346343 142 19 116 7 10347301 110 18 88 4 10352443 21 17 3 1 10350925 113 16 80 17 10322435 37 16 3 18 9966001 48 15 16 17 9950917 230 15 166 49 10336881 196 15 164 17 10280835 34 15 18 1 9971391 50 14 6 30 10344553 34 14 8 12 10320266 19 14 4 1 10306147 15 14 0 1 9988108 142 13 112 17 9893974 97 12 74 11 10352700 28 12 7 9 10349102 69 12 43 14 10341111 49 11 21 17 9960332 281 10 262 9 9955763 27 10 4 13 10236611 36 10 12 14 9892877 131 9 89 33 10348281 26 9 1 16 10309217 36 9 10 17 9840513 55 8 33 14 10343421 40 8 30 2 10353862 21 7 4 10 10353802 9 7 1 1 10349590 74 7 42 25 10348159 24 7 10 7 10344590 17 7 3 7 10332410 32 7 19 6 10305755 40 7 8 25 10342366 25 6 11 8 10341116 32 6 9 17 10336862 56 6 41 9 9931381 35 5 13 17 9902316 48 5 29 14 10350872 42 5 18 19 10350787 44 5 23 16 10344573 8 5 2 1 10343210 34 5 17 12 10341205 25 5 11 9 10308608 12 5 3 4 10299309 129 5 107 17 9991405 24 4 19 1 9963745 16 4 8 4 9953042 348 4 311 33 9884793 24 4 18 2 10348032 6 4 1 1 10347163 6 4 1 1 10345781 57 4 39 14 10337368 24 4 12 8 10331329 53 4 34 15 10317397 9 4 0 5 10311474 20 4 7 9 10301852 25 4 12 9 10190874 178 4 141 33 10139869 31 4 10 17 9992985 17 3 10 4 9991848 24 3 9 12 9958881 16 3 12 1 9941645 61 3 35 23 10351401 36 3 28 5 10343984 19 3 4 12 10343459 416 3 348 65 10330237 19 3 12 4 10327944 9 3 5 1 10312108 58 3 52 3 10236607 28 3 16 9 10169901 53 3 41 9 9971261 34 2 15 17 9966659 25 2 6 17 9933653 40 2 32 6 9919005 260 2 245 13 9881897 228 2 217 9 10353661 53 2 35 16 10352846 7 2 4 1 10351174 423 2 383 38 10348361 32 2 1 29 10345918 26 2 23 1 10342502 116 2 106 8 10341421 47 2 41 4 10336298 808 2 757 49 10335589 5 2 2 1 10333179 31 2 4 25 10327857 6 2 3 1 10324555 44 2 40 2 10322210 346 2 330 14 10321892 6 2 1 3 10314902 301 2 266 33 10313334 21 2 18 1 10309195 173 2 154 17 10298281 39 2 28 9 10258019 39 2 19 18 10200198 34 2 26 6 10143577 36 2 31 3 9999056 166 1 140 25 9997923 34 1 27 6 9996566 116 1 100 15 9996322 166 1 148 17 9995685 85 1 82 2 9993117 35 1 18 16 9992739 34 1 32 1 9991766 47 1 43 3 9991322 233 1 199 33 9990804 240 1 210 29 9990687 5 1 3 1 9989622 25 1 17 7 9988234 56 1 22 33 9988040 32 1 27 4 9974819 42 1 24 17 9973718 5 1 2 2 9972836 44 1 26 17 9972175 37 1 19 17 9971538 54 1 46 7 9971256 34 1 16 17 9971242 46 1 28 17 9971132 16 1 11 4 9969228 3 1 1 1 9966658 5 1 1 3 9966525 26 1 12 13 9966301 37 1 33 3 9965532 134 1 79 54 9964314 72 1 54 17 9963920 25 1 18 6 9963512 25 1 19 5 9962374 34 1 30 3 9960785 26 1 24 1 9957650 13 1 11 1 9955291 95 1 80 14 9954883 36 1 32 3 9954612 55 1 49 5 9954039 60 1 50 9 9953507 3 1 1 1 9952901 7 1 5 1 9951177 74 1 42 31 9948162 38 1 33 4 9946511 16 1 11 4 9944798 137 1 123 13 9942760 134 1 68 65 9941855 26 1 13 12 9938031 317 1 283 33 9937649 96 1 78 17 9936629 10 1 7 2 9935624 5 1 3 1 9934742 3 1 0 2 9932526 5 1 2 2 9931058 172 1 151 20 9930241 36 1 31 4 9925514 6 1 2 3 9925468 3 1 1 1 9923236 24 1 16 7 9922920 5 1 3 1 9921334 5 1 3 1 9919857 138 1 120 17 9918197 5 1 1 3 9917887 239 1 218 20 9916011 46 1 43 2 9914048 81 1 67 13 9907027 116 1 107 8 9905027 30 1 19 10 9895131 5 1 2 2 9890014 22 1 18 3 9887801 26 1 23 2 9882122 32 1 25 6 9873056 4 1 1 2 9872679 90 1 83 6 9870066 19 1 12 6 9853997 319 1 291 27 9846970 133 1 103 29 9778571 115 1 83 31 9775346 37 1 33 3 9771200 34 1 27 6 9766407 25 1 15 9 9704241 120 1 111 8 9704204 6 1 4 1 9685969 26 1 22 3 9666602 15 1 5 9 9652331 66 1 59 6 9640654 37 1 35 1 9636932 85 1 67 17 9628514 80 1 68 11 8307698 28 1 23 4 4481947 41 1 35 5 21673 5 1 0 4 10354454 3 1 1 1 10354419 129 1 102 26 10354216 67 1 52 14 10354210 106 1 88 17 10353993 38 1 30 7 10353991 5 1 1 3 10353942 33 1 17 15 10353906 7 1 4 2 10353824 38 1 31 6 10353601 85 1 67 17 10353435 7 1 2 4 10353262 5 1 3 1 10352905 84 1 64 19 10352744 31 1 3 27 10352738 28 1 22 5 10352626 41 1 39 1 10352615 51 1 22 28 10352534 16 1 14 1 10352270 103 1 69 33 10352203 152 1 134 17 10352151 17 1 15 1 10352136 5 1 1 3 10351816 371 1 363 7 10351308 16 1 14 1 10351275 146 1 116 29 10351255 1392 1 1290 101 10351254 840 1 710 129 10351253 1378 1 1272 105 10351251 1166 1 1074 91 10351246 1355 1 1246 108 10351209 194 1 160 33 10351061 35 1 17 17 10350959 32 1 23 8 10350934 2 1 0 1 10350877 5 1 3 1 10350643 237 1 204 32 10350636 3 1 1 1 10350601 45 1 36 8 10350588 5 1 1 3 10350435 139 1 135 3 10350016 33 1 26 6 10349639 50 1 36 13 10349506 322 1 288 33 10349334 74 1 56 17 10349278 75 1 68 6 10349165 26 1 22 3 10349148 4 1 2 1 10348882 36 1 26 9 10348526 137 1 119 17 10348518 16 1 6 9 10348381 8 1 6 1 10348346 305 1 275 29 10348267 5 1 2 2 10348155 3 1 1 1 10348107 85 1 79 5 10347987 111 1 77 33 10347905 30 1 27 2 10347686 16 1 6 9 10347099 224 1 190 33 10346841 60 1 40 19 10346670 34 1 0 33 10346648 6 1 3 2 10346420 45 1 27 17 10346167 30 1 13 16 10346139 56 1 45 10 10346131 321 1 287 33 10346001 14 1 6 7 10345984 24 1 16 7 10345916 16 1 8 7 10345721 5 1 3 1 10345683 372 1 263 108 10345639 55 1 51 3 10345459 66 1 48 17 10345449 141 1 123 17 10345425 139 1 121 17 10345408 66 1 48 17 10345271 33 1 19 13 10345184 40 1 36 3 10345122 32 1 21 10 10344886 36 1 26 9 10344813 38 1 33 4 10344811 71 1 41 29 10344805 171 1 169 1 10344425 30 1 20 9 10344406 5 1 1 3 10344378 65 1 55 9 10344330 27 1 16 10 10344183 6 1 4 1 10344075 96 1 78 17 10344017 5 1 1 3 10343498 106 1 102 3 10343448 24 1 13 10 10343239 16 1 12 3 10343136 32 1 26 5 10343135 6 1 3 2 10342951 10 1 7 2 10342873 57 1 40 16 10342632 55 1 37 17 10342507 245 1 235 9 10342384 55 1 41 13 10342304 39 1 34 4 10342197 36 1 10 25 10341892 521 1 472 48 10341890 586 1 488 97 10341868 832 1 745 86 10341858 776 1 678 97 10341382 16 1 13 2 10340899 253 1 229 23 10340552 32 1 14 17 10340522 74 1 70 3 10340311 31 1 27 3 10340276 84 1 74 9 10339911 210 1 192 17 10339873 169 1 151 17 10339698 5 1 2 2 10339155 164 1 141 22 10338745 235 1 219 15 10338664 34 1 30 3 10338102 103 1 85 17 10337429 16 1 13 2 10336929 25 1 19 5 10336792 161 1 148 12 10336747 34 1 31 2 10336397 739 1 679 59 10336392 760 1 662 97 10336391 634 1 536 97 10336384 1116 1 1018 97 10335980 334 1 297 36 10335978 6 1 4 1 10335869 38 1 32 5 10335778 364 1 330 33 10335752 44 1 41 2 10335522 24 1 14 9 10335222 3 1 1 1 10335093 671 1 617 53 10334839 367 1 350 16 10334758 55 1 48 6 10334611 116 1 111 4 10334209 1127 1 997 129 10334196 990 1 877 112 10333493 9 1 6 2 10333225 329 1 295 33 10332370 18 1 14 3 10331578 183 1 117 65 10331432 16 1 12 3 10331262 37 1 34 2 10331100 245 1 220 24 10330911 5 1 2 2 10330867 310 1 260 49 10330558 166 1 155 10 10330401 62 1 58 3 10329635 20 1 18 1 10329451 139 1 109 29 10329057 16 1 10 5 10328885 142 1 132 9 10328871 36 1 18 17 10327610 39 1 29 9 10327237 44 1 36 7 10327154 38 1 12 25 10326961 50 1 32 17 10326604 106 1 75 30 10326422 48 1 30 17 10326361 58 1 32 25 10325688 5 1 1 3 10325647 158 1 132 25 10325580 76 1 42 33 10324997 138 1 88 49 10324946 5 1 3 1 10324651 5 1 1 3 10324235 16 1 7 8 10323958 41 1 38 2 10323685 3 1 1 1 10323095 39 1 21 17 10322405 35 1 18 16 10322380 38 1 20 17 10322239 5 1 3 1 10321861 32 1 27 4 10321730 34 1 16 17 10320885 2 1 0 1 10320732 23 1 10 12 10320350 36 1 32 3 10320295 139 1 133 5 10320269 19 1 13 5 10320222 17 1 15 1 10320150 288 1 261 26 10320038 38 1 26 11 10319367 5 1 3 1 10319287 25 1 16 8 10319184 146 1 141 4 10319077 69 1 56 12 10318981 37 1 31 5 10318971 362 1 329 32 10318518 16 1 12 3 10317991 37 1 19 17 10316574 58 1 51 6 10315831 49 1 17 31 10315727 139 1 127 11 10315099 25 1 22 2 10314074 6 1 4 1 10314008 233 1 199 33 10313467 6 1 2 3 10311640 16 1 14 1 10311209 33 1 29 3 10310115 641 1 590 50 10310009 752 1 705 46 10309867 35 1 13 21 10309665 42 1 33 8 10309196 194 1 160 33 10309122 36 1 34 1 10308732 234 1 212 21 10307865 16 1 7 8 10307516 65 1 48 16 10307419 30 1 12 17 10306843 38 1 4 33 10306655 83 1 61 21 10306550 394 1 351 42 10306366 35 1 25 9 10306110 34 1 30 3 10306097 36 1 26 9 10306071 39 1 29 9 10305666 18 1 16 1 10305325 34 1 0 33 10304346 40 1 22 17 10304110 113 1 103 9 10303972 264 1 254 9 10303902 75 1 57 17 10303879 14 1 6 7 10303534 34 1 16 17 10302798 154 1 136 17 10301869 86 1 81 4 10301821 195 1 188 6 10301687 16 1 13 2 10301325 34 1 21 12 10300314 85 1 67 17 10299952 68 1 62 5 10299110 69 1 57 11 10298502 91 1 73 17 10298501 34 1 24 9 10298280 45 1 36 8 10297885 35 1 26 8 10297845 41 1 31 9 10295251 186 1 161 24 10294482 3 1 1 1 10294353 50 1 44 5 10294211 16 1 7 8 10292713 149 1 115 33 10292684 14 1 8 5 10291556 45 1 27 17 10291342 5 1 2 2 10290431 37 1 19 17 10290025 156 1 142 13 10288569 39 1 33 5 10288255 6 1 4 1 10287984 32 1 28 3 10287866 176 1 171 4 10287031 16 1 9 6 10286373 65 1 56 8 10285998 30 1 22 7 10285945 95 1 70 24 10285665 4 1 2 1 10284480 30 1 12 17 10282264 224 1 186 37 10281825 10 1 6 3 10281788 35 1 17 17 10281657 39 1 23 15 10281444 37 1 25 11 10280746 116 1 94 21 10261912 241 1 237 3 10261618 112 1 92 19 10261232 78 1 64 13 10261158 21 1 13 7 10261108 340 1 302 37 10258084 37 1 30 6 10257711 137 1 114 22 10236016 64 1 55 8 10235574 30 1 16 13 10235324 128 1 102 25 10235139 37 1 33 3 10234982 8 1 4 3 102105 30 1 16 13 10190689 146 1 112 33 10158514 38 1 28 9 10158176 14 1 11 2 10157836 59 1 49 9 10156328 35 1 22 12 10156028 48 1 44 3 10154879 18 1 10 7 10154412 6 1 3 2 10153637 15 1 12 2 10153032 32 1 26 5 10143256 32 1 26 5 10142647 35 1 18 16 10142641 39 1 21 17 10142376 34 1 28 5 10142108 32 1 21 10 10141941 116 1 106 9 10139952 38 1 33 4 10139306 34 1 16 17 10139061 36 1 18 17 10138488 64 1 59 4 10011827 34 1 29 4 10011343 25 1 21 3 10010629 34 1 16 17 10010063 116 1 82 33 10001769 111 1 93 17 10000380 32 1 14 17 |
Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837 |
Have only OpenSuse(x64) 13.2 and 42.2 as Guest-OS in Windows 10(x64)pro. They run only Sixtrack and WCG, as they become work from one of this Projects. This is over a long time, without problems of Hardware so far. Will checking this Computers in your list. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
....and anecdotally, I experienced failures with SixTrack on a CERN Computer Centre batch node a few years ago. Nothing could be found on testing, but the machine "died" a couple of weeks later. Eric. |
Send message Joined: 18 Sep 04 Posts: 30 Credit: 5,100,929 RAC: 0 |
:) Michael. |
Send message Joined: 23 Jan 17 Posts: 29 Credit: 375,570 RAC: 0 |
Hi Kevin, if you have some experience with Linux, you could try to build the executable + test harness locally and run through the tests? To do that, clone the sources from GitHub: https://github.com/sixtrack/sixtrack then cd into the "SixTrack" subfolder, and run ./cmake_six BUILD_TESTING This will make a new folder with the executable. You can the run the tests with ctest -j 4 where I assumed that you want to run on 4 cores. You can also run only the fast/medium/slow tests by adding the flag -L fast where you can replace fast with medium or slow. If that all passes, the next step would be to build the real BOINC version; to do that first build the supporting libraries by running the "buildLibraries.sh" script, then build sixtrack as ./cmake_six BUILD_TESTING BOINC API LIBARCHIVE CR BIGNBLZ FYI, you may have to install some libraries - especially some static libraries - to do these builds. It should be possible to "cheat" and do a dynamic build by adding "-STATIC" to the cmake_six command. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Just to explain a bit; hope to have a fix very very soon. Eric. Because a null/empty fort.10 is treated as Valid we have a major problem. For some reason somewhere in SixDesk/BOINC servers at CERN and BOINC clients we are now getting many more of these than in the past. I do not know how bad or how many as we still do not know where to find the archived assimilator and validator logs. This means that two null results can be validated and a possibly valid result invalidated. A real mess. Perhaps we could temporarily update the number of copies of each WU to say 5, a horrible work around, and a waste of volunteer resources. It would be much better to Invalidate null/empty fort.10 to get some meaningful numbers. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Further updates on this topic will go to the SixTrack Application Message Board. Eric. |
©2024 CERN