Message boards :
Sixtrack Application :
Inconclusive, valid/invalid results
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 21 Aug 07 Posts: 46 Credit: 1,503,835 RAC: 0 |
I thought an update of my inconclusives is warranted here. After the old validator was reimplemented my inconclusives immediately dropped back to 5 from around 18. Since then, they have slowly climbed back to the current level of 19. And, as I stated below (relative to the new validator), the new group is different. That is, I am not only getting inconclusives when paired with x86_64-pc-linux-gnu hosts but now, also with some Windows hosts: see Validation inconclusive tasks for Stick Don't know if this is good or bad news, but immediately after the validator change, my inconclusive count jumped from 6 to 11. And the new group is very different. Prior to the change, all 6 of my inconclusives were paired against tasks done by x86_64-pc-linux-gnu machines. Now, 4 out of the 5 new ones were pairings between my SixTrack v451.07 (sse2) windows_x86_64 tasks and a variety of machines running SixTrack v451.07 (pni) windows_x86_64. |
Send message Joined: 13 Jul 05 Posts: 133 Credit: 162,641 RAC: 0 |
Inconclusives rising again ? Situation here is as follows:- Host 10414945 Total WU 181 In progress 56 Pending 50 Inconclusive 38 Valid 37 Invalid 0 Error 0 |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,381,453 RAC: 102,769 |
For all my hosts the numbers are: Pending 66.03% Inconclusive 15.31% Vaild 18.66% There is still a big backlog on the validator it appears. I will make detailed analysis this weekend. |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,381,453 RAC: 102,769 |
I took my last 100 Inconclusive results. Some trends I have all windows, 87% was with a Linux Wingman Most common CPU's each with 10 AMD FX-8300 (3.19.0-32-generic), E5-2630 v3 (2.6.32-642.15.1.el6.x86_64, Windows 8.1 & Windows 10) & E5-2699C v4 (4.1.12-94.3.5.el7uek.x86_64) 43% have CPU time less than 50sec, 25% above 1000sec on the wingman Of the short one there is high probabilty that the linux host fails in much shorter time than the windows host. For long one they are much closer together. |
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,785,561 RAC: 15,033 |
Right now I am at...... 23 Valid 22 Validation inconclusive 179 Validation pending |
Send message Joined: 11 Feb 13 Posts: 22 Credit: 20,728,480 RAC: 31 |
Yes I've noticed the my inconclusives rising. Currently sitting at Validation Pending - 236 Validation Inconclusive - 168 Valid - down from 100 yesterday to currently 83. https://lhcathome.cern.ch/lhcathome/results.php?userid=250933 |
Send message Joined: 13 Jul 05 Posts: 133 Credit: 162,641 RAC: 0 |
Any light at the end of tunnel yet? Situation here is as follows:- Host 10414945 Total WU 513 In progress 77 Pending 80 Inconclusive 190 Valid 166 Invalid 0 Error 0 |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
The number to look at is the number of Invalid results rather than Inconclusives. If you have zero Invalids then there is nothing wrong with your host. You will need to wait a little longer for the resent job to be returned but this is just similar to having a job Pending from a slower wingman or someone with a larger cache of work than their machine can return in reasonable time. Eric has been working hard at isolating rogue hosts and optimising the validation process. Better to get an inconclusive result and a short delay than having a bad result falsely validated as good. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I am now waiting for management/support to take action. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks Ray, "short" delay is too long for me, but I am hoping for some management/support action tomorrow latest. Eric. The number to look at is the number of Invalid results rather than Inconclusives. If you have zero Invalids then there is nothing wrong with your host. You will need to wait a little longer for the resent job to be returned but this is just similar to having a job Pending from a slower wingman or someone with a larger cache of work than their machine can return in reasonable time. |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,381,453 RAC: 102,769 |
Things are looking much better: Error 0.02% Invalid 0.05% Valid 43.12% inconclusive 25.58% Pending 31.23% Looks like the re-runs are comming though and a 2nd wingman validates. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks Toby, and I can also report that overnight "No consensus" have dropped 20,516 -> 16,660 -> 12,281 (last 24 hours) and 445,068 -> 444,151 -> 443,080 (last 7 days). Still some way to go but patience is called for. I should also finally, after3 years!, be able to make a proper analysis of the real invalid results and of other problems. Things are looking much better: |
Send message Joined: 6 Mar 12 Posts: 7 Credit: 3,130,996 RAC: 0 |
What's wrong with my statistics? #https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=70891665 #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10483219&offset=80&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10415523&offset=0&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10482091&offset=160&show_names=0&state=3&appid= #https://lhcathome.cern.ch/lhcathome/results.php?hostid=10450260&offset=20&show_names=0&state=3&appid= |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Will look at this soonest. Very interesting and useful. Eric. What's wrong with my statistics? |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,381,453 RAC: 102,769 |
most are waiting for a 3rd host to re-run WU and validate or not. for 70891665, no one could agree so it's marked as can't validate as the maxiumium number of tries is 5. In this case it's probally a bad WU as the probabilty of 4 different computers being bad is very low. Eric could comment on how a WU could be bad from a fundemental perspective |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Well this is the BIG problem. We had a three hour discussion today, but not everyone, if anyone, agrees with me. We are seeing too many bad/null results, mixed up results, but still getting a lot done. I think this a good thing on the long term BUT we must identify the source of these bad/null results. I myself suspect corrupted input and/or output with good reason. Sadly I guess after max tries the WU will be abandoned. The work unit is fine though. So no credits for these, and the CERN user will have to re-submit. Nonetheless I now see a few cases with 3 runs, one rejected and two good results. I also see 553,615 valid results in the last 7 days. Great, but not good enough. I plead my case, I am not responsible for these errors nor for all the transient errors we have seen. (I did stop/start the validator on24th June and problems started around 03:00 on 25th.) The sixtrack_validator has been much improved since then by my colleagues. A side effect is the problem you, and I, and others have seen. If we are lucky and you don't get a 3rd bad result all will be well. The number of inconclusive, "No consensus" is decreasing but too slowly. I think we must identify the source of these bad/null/invalid results. (Just my opinion.) The much improved validator at least allows us to clearly identify the bad tasks. (Atlas has the same or very similar problem; not sure they know that!) I have NEVER found a Fundamental problem with a Work Unit. They may exist but the CERN users are responsible people who check their initial conditions before submitting to BOINC/LHC@home. At least the number of inconclusive is down to 5,939 for last 24 hours and while decreasing is still at 406573 for the last 7 days. but also decreasing. This is in fact good news I think as we are down by almost 40,000. Patience, which I am sadly lacking, is required all round. We live in hope and are working hard to sort this out. Eric. most are waiting for a 3rd host to re-run WU and validate or not. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Please see my reply http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=4306&postid=31311 Is that a hyperlink???? I have the same problem myself with some wzero/jtbb2cm1 cases. Will look at this soonest. Very interesting and useful. Eric. |
Send message Joined: 6 Mar 12 Posts: 7 Credit: 3,130,996 RAC: 0 |
Eric, thank you so much for your huge work you do for all of us. Did not expect that this simple question would cause such heated debate. db4m2 ( wherein : db - [D]istributed [B]oinc 4 - [F]or m - [M]e 2 - [T]o ) errors Too many total results https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=72453472 Today is Friday. Need to change db - [D]istributed [B]oinc to db - [D]ouble [B]eer |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I have finally understood what you and my colleagues have been trying to tell me! We are in the unfortunate situation where we are unable to distinguish between an infrastructure failure and a case where all particles are lost before completing (typically) a thousand turns. I am a bit slow sometimes and I am obsessed by the unexplained failures of which there are currently too many. To try and be clear, if I can: SixTrack has three phases, pre-processing, tracking and post-processing. Under the present setup, and at the request of a colleague more than ten years ago, a SixTrack run on LHC@home returns only one file fort.10 as a result from the post-processing phase. This was done in order to reduce network traffic and the load on the server. fort.10, the result, has one very long line for each particle, 60 double precision floating-point numbers per line, normally 60 lines around 40 KiloBytes. IF we have a bad set of initial conditions (very unlikely, but possible) leading to a pre-processing failure OR more likely all particles are lost in tracking before completing typically a thousand turns OR we never perform post-processing for some other reason THEN SixTrack stops and returns an empty fort.10 result. ENDIF BUT Infrastructure failures or run time errors may also produce a null empty result file. The sixtrack_validator now rejects such null results but clearly identifies it has done so. We cannot distinguish between a genuine all particles lost or some other failure, like a run time crash, segment violation, etc. These cases will now never be Validated and after 5 attempts the Work Unit will be scrapped. That is the bad news. Random infrastructure failures should normally be rejected and invalidated as we expect 2 other Valid results. The relatively good news is that since all particles are lost, little CPU time, real time, or credits are wasted. PHEW.... my colleagues are currently trying to test a new SixTrack version which will allow us to clearly distinguish the reason for a null/empty result file and our problem will be solved and we should in particular be able to identify the source of empty results which are not due to all particles lost. Summary, No action is required by you. A new SixTrack will be released as soon as possible. Eric. |
Send message Joined: 15 Jun 08 Posts: 2532 Credit: 253,722,201 RAC: 34,439 |
Is it possible to do the following workaround? IF |
©2024 CERN