Inconclusive, valid/invalid results

Author	Message
Stick Send message Joined: 21 Aug 07 Posts: 47 Credit: 1,508,478 RAC: 354	Message 31188 - Posted: 29 Jun 2017, 20:42:20 UTC - in response to Message 31057. I thought an update of my inconclusives is warranted here. After the old validator was reimplemented my inconclusives immediately dropped back to 5 from around 18. Since then, they have slowly climbed back to the current level of 19. And, as I stated below (relative to the new validator), the new group is different. That is, I am not only getting inconclusives when paired with x86_64-pc-linux-gnu hosts but now, also with some Windows hosts: see Validation inconclusive tasks for Stick Don't know if this is good or bad news, but immediately after the validator change, my inconclusive count jumped from 6 to 11. And the new group is very different. Prior to the change, all 6 of my inconclusives were paired against tasks done by x86_64-pc-linux-gnu machines. Now, 4 out of the 5 new ones were pairings between my SixTrack v451.07 (sse2) windows_x86_64 tasks and a variety of machines running SixTrack v451.07 (pni) windows_x86_64. ID: 31188 · Reply Quote

John Hunt Send message Joined: 13 Jul 05 Posts: 133 Credit: 162,641 RAC: 0	Message 31194 - Posted: 30 Jun 2017, 11:25:16 UTC Last modified: 30 Jun 2017, 11:26:34 UTC Inconclusives rising again ? Situation here is as follows:- Host 10414945 Total WU 181 In progress 56 Pending 50 Inconclusive 38 Valid 37 Invalid 0 Error 0 ID: 31194 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 747,042,340 RAC: 325,374	Message 31201 - Posted: 30 Jun 2017, 15:55:14 UTC For all my hosts the numbers are: Pending 66.03% Inconclusive 15.31% Vaild 18.66% There is still a big backlog on the validator it appears. I will make detailed analysis this weekend. ID: 31201 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 747,042,340 RAC: 325,374	Message 31208 - Posted: 30 Jun 2017, 20:01:39 UTC I took my last 100 Inconclusive results. Some trends I have all windows, 87% was with a Linux Wingman Most common CPU's each with 10 AMD FX-8300 (3.19.0-32-generic), E5-2630 v3 (2.6.32-642.15.1.el6.x86_64, Windows 8.1 & Windows 10) & E5-2699C v4 (4.1.12-94.3.5.el7uek.x86_64) 43% have CPU time less than 50sec, 25% above 1000sec on the wingman Of the short one there is high probabilty that the linux host fails in much shorter time than the windows host. For long one they are much closer together. ID: 31208 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1235 Credit: 79,806,744 RAC: 75,254	Message 31209 - Posted: 30 Jun 2017, 20:06:59 UTC Right now I am at...... 23 Valid 22 Validation inconclusive 179 Validation pending ID: 31209 · Reply Quote

Chooka Send message Joined: 11 Feb 13 Posts: 24 Credit: 23,093,266 RAC: 3,010	Message 31213 - Posted: 1 Jul 2017, 0:58:37 UTC Yes I've noticed the my inconclusives rising. Currently sitting at Validation Pending - 236 Validation Inconclusive - 168 Valid - down from 100 yesterday to currently 83. https://lhcathome.cern.ch/lhcathome/results.php?userid=250933 ID: 31213 · Reply Quote

John Hunt Send message Joined: 13 Jul 05 Posts: 133 Credit: 162,641 RAC: 0	Message 31237 - Posted: 2 Jul 2017, 17:34:30 UTC Any light at the end of tunnel yet? Situation here is as follows:- Host 10414945 Total WU 513 In progress 77 Pending 80 Inconclusive 190 Valid 166 Invalid 0 Error 0 ID: 31237 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 831	Message 31238 - Posted: 2 Jul 2017, 17:55:58 UTC The number to look at is the number of Invalid results rather than Inconclusives. If you have zero Invalids then there is nothing wrong with your host. You will need to wait a little longer for the resent job to be returned but this is just similar to having a job Pending from a slower wingman or someone with a larger cache of work than their machine can return in reasonable time. Eric has been working hard at isolating rogue hosts and optimising the validation process. Better to get an inconclusive result and a short delay than having a bad result falsely validated as good. ID: 31238 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31240 - Posted: 2 Jul 2017, 20:00:47 UTC - in response to Message 31237. I am now waiting for management/support to take action. Eric. ID: 31240 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31241 - Posted: 2 Jul 2017, 20:02:22 UTC - in response to Message 31238. Thanks Ray, "short" delay is too long for me, but I am hoping for some management/support action tomorrow latest. Eric. The number to look at is the number of Invalid results rather than Inconclusives. If you have zero Invalids then there is nothing wrong with your host. You will need to wait a little longer for the resent job to be returned but this is just similar to having a job Pending from a slower wingman or someone with a larger cache of work than their machine can return in reasonable time. Eric has been working hard at isolating rogue hosts and optimising the validation process. Better to get an inconclusive result and a short delay than having a bad result falsely validated as good. ID: 31241 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 747,042,340 RAC: 325,374	Message 31302 - Posted: 5 Jul 2017, 21:30:54 UTC Last modified: 5 Jul 2017, 21:32:57 UTC Things are looking much better: Error 0.02% Invalid 0.05% Valid 43.12% inconclusive 25.58% Pending 31.23% Looks like the re-runs are comming though and a 2nd wingman validates. ID: 31302 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31304 - Posted: 6 Jul 2017, 7:08:48 UTC - in response to Message 31302. Thanks Toby, and I can also report that overnight "No consensus" have dropped 20,516 -> 16,660 -> 12,281 (last 24 hours) and 445,068 -> 444,151 -> 443,080 (last 7 days). Still some way to go but patience is called for. I should also finally, after3 years!, be able to make a proper analysis of the real invalid results and of other problems. Things are looking much better: Error 0.02% Invalid 0.05% Valid 43.12% inconclusive 25.58% Pending 31.23% Looks like the re-runs are coming though and a 2nd wingman validates. ID: 31304 · Reply Quote

Demis Send message Joined: 6 Mar 12 Posts: 7 Credit: 3,130,996 RAC: 0	Message 31305 - Posted: 6 Jul 2017, 9:13:31 UTC What's wrong with my statistics? #https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=70891665 #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10483219&offset=80&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10415523&offset=0&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10482091&offset=160&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10450260&offset=20&show_names=0&state=3&appid= ID: 31305 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31306 - Posted: 6 Jul 2017, 10:30:48 UTC - in response to Message 31305. Will look at this soonest. Very interesting and useful. Eric. What's wrong with my statistics? #https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=70891665 #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10483219&offset=80&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10415523&offset=0&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10482091&offset=160&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10450260&offset=20&show_names=0&state=3&appid= ID: 31306 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 747,042,340 RAC: 325,374	Message 31308 - Posted: 6 Jul 2017, 17:46:40 UTC - in response to Message 31305. most are waiting for a 3rd host to re-run WU and validate or not. for 70891665, no one could agree so it's marked as can't validate as the maxiumium number of tries is 5. In this case it's probally a bad WU as the probabilty of 4 different computers being bad is very low. Eric could comment on how a WU could be bad from a fundemental perspective ID: 31308 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31311 - Posted: 6 Jul 2017, 21:47:31 UTC - in response to Message 31308. Well this is the BIG problem. We had a three hour discussion today, but not everyone, if anyone, agrees with me. We are seeing too many bad/null results, mixed up results, but still getting a lot done. I think this a good thing on the long term BUT we must identify the source of these bad/null results. I myself suspect corrupted input and/or output with good reason. Sadly I guess after max tries the WU will be abandoned. The work unit is fine though. So no credits for these, and the CERN user will have to re-submit. Nonetheless I now see a few cases with 3 runs, one rejected and two good results. I also see 553,615 valid results in the last 7 days. Great, but not good enough. I plead my case, I am not responsible for these errors nor for all the transient errors we have seen. (I did stop/start the validator on24th June and problems started around 03:00 on 25th.) The sixtrack_validator has been much improved since then by my colleagues. A side effect is the problem you, and I, and others have seen. If we are lucky and you don't get a 3rd bad result all will be well. The number of inconclusive, "No consensus" is decreasing but too slowly. I think we must identify the source of these bad/null/invalid results. (Just my opinion.) The much improved validator at least allows us to clearly identify the bad tasks. (Atlas has the same or very similar problem; not sure they know that!) I have NEVER found a Fundamental problem with a Work Unit. They may exist but the CERN users are responsible people who check their initial conditions before submitting to BOINC/LHC@home. At least the number of inconclusive is down to 5,939 for last 24 hours and while decreasing is still at 406573 for the last 7 days. but also decreasing. This is in fact good news I think as we are down by almost 40,000. Patience, which I am sadly lacking, is required all round. We live in hope and are working hard to sort this out. Eric. most are waiting for a 3rd host to re-run WU and validate or not. for 70891665, no one could agree so it's marked as can't validate as the maxiumium number of tries is 5. In this case it's probally a bad WU as the probability of 4 different computers being bad is very low. Eric could comment on how a WU could be bad from a fundemental perspective ID: 31311 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31312 - Posted: 6 Jul 2017, 21:49:34 UTC - in response to Message 31306. Please see my reply http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=4306&postid=31311 Is that a hyperlink???? I have the same problem myself with some wzero/jtbb2cm1 cases. Will look at this soonest. Very interesting and useful. Eric. What's wrong with my statistics? #https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=70891665 #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10483219&offset=80&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10415523&offset=0&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10482091&offset=160&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10450260&offset=20&show_names=0&state=3&appid= ID: 31312 · Reply Quote

Demis Send message Joined: 6 Mar 12 Posts: 7 Credit: 3,130,996 RAC: 0	Message 31313 - Posted: 7 Jul 2017, 7:03:04 UTC - in response to Message 31312. Last modified: 7 Jul 2017, 7:40:21 UTC Eric, thank you so much for your huge work you do for all of us. Did not expect that this simple question would cause such heated debate. db4m2 ( wherein : db - [D]istributed [B]oinc 4 - [F]or m - [M]e 2 - [T]o ) errors Too many total results https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=72453472 Today is Friday. Need to change db - [D]istributed [B]oinc to db - [D]ouble [B]eer ID: 31313 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 31315 - Posted: 7 Jul 2017, 11:01:11 UTC - in response to Message 31311. I have finally understood what you and my colleagues have been trying to tell me! We are in the unfortunate situation where we are unable to distinguish between an infrastructure failure and a case where all particles are lost before completing (typically) a thousand turns. I am a bit slow sometimes and I am obsessed by the unexplained failures of which there are currently too many. To try and be clear, if I can: SixTrack has three phases, pre-processing, tracking and post-processing. Under the present setup, and at the request of a colleague more than ten years ago, a SixTrack run on LHC@home returns only one file fort.10 as a result from the post-processing phase. This was done in order to reduce network traffic and the load on the server. fort.10, the result, has one very long line for each particle, 60 double precision floating-point numbers per line, normally 60 lines around 40 KiloBytes. IF we have a bad set of initial conditions (very unlikely, but possible) leading to a pre-processing failure OR more likely all particles are lost in tracking before completing typically a thousand turns OR we never perform post-processing for some other reason THEN SixTrack stops and returns an empty fort.10 result. ENDIF BUT Infrastructure failures or run time errors may also produce a null empty result file. The sixtrack_validator now rejects such null results but clearly identifies it has done so. We cannot distinguish between a genuine all particles lost or some other failure, like a run time crash, segment violation, etc. These cases will now never be Validated and after 5 attempts the Work Unit will be scrapped. That is the bad news. Random infrastructure failures should normally be rejected and invalidated as we expect 2 other Valid results. The relatively good news is that since all particles are lost, little CPU time, real time, or credits are wasted. PHEW.... my colleagues are currently trying to test a new SixTrack version which will allow us to clearly distinguish the reason for a null/empty result file and our problem will be solved and we should in particular be able to identify the source of empty results which are not due to all particles lost. Summary, No action is required by you. A new SixTrack will be released as soon as possible. Eric. ID: 31315 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2685 Credit: 286,926,887 RAC: 56,964	Message 31317 - Posted: 7 Jul 2017, 11:49:54 UTC - in response to Message 31315. Is it possible to do the following workaround? IF we have a bad set of initial conditions (very unlikely, but possible) leading to a pre-processing failure Insert an error code in the fort.10 OR more likely all particles are lost in tracking before completing typically a thousand turns Insert an error code in the fort.10 OR we never perform post-processing for some other reason Insert an error code in the fort.10 THEN ~~SixTrack stops and returns an empty fort.10 result.~~ Now SixTrack stops and returns fort.10 with error code(s). ENDIF BUT Infrastructure failures or run time errors may also produce a null empty result file. Now they can be identified. The sixtrack_validator now rejects such null results but clearly identifies it has done so. The validator (or a separate script) strips the error code lines if there are valid result lines. ID: 31317 · Reply Quote

LHC@home