Message boards : Sixtrack Application : Inconclusive, valid/invalid results
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
Stick

Send message
Joined: 21 Aug 07
Posts: 46
Credit: 1,503,835
RAC: 0
Message 31188 - Posted: 29 Jun 2017, 20:42:20 UTC - in response to Message 31057.  

I thought an update of my inconclusives is warranted here. After the old validator was reimplemented my inconclusives immediately dropped back to 5 from around 18. Since then, they have slowly climbed back to the current level of 19. And, as I stated below (relative to the new validator), the new group is different. That is, I am not only getting inconclusives when paired with x86_64-pc-linux-gnu hosts but now, also with some Windows hosts: see Validation inconclusive tasks for Stick

Don't know if this is good or bad news, but immediately after the validator change, my inconclusive count jumped from 6 to 11. And the new group is very different. Prior to the change, all 6 of my inconclusives were paired against tasks done by x86_64-pc-linux-gnu machines. Now, 4 out of the 5 new ones were pairings between my SixTrack v451.07 (sse2) windows_x86_64 tasks and a variety of machines running SixTrack v451.07 (pni) windows_x86_64.
ID: 31188 · Report as offensive     Reply Quote
Profile John Hunt

Send message
Joined: 13 Jul 05
Posts: 133
Credit: 162,641
RAC: 0
Message 31194 - Posted: 30 Jun 2017, 11:25:16 UTC
Last modified: 30 Jun 2017, 11:26:34 UTC

Inconclusives rising again ?

Situation here is as follows:-
Host 10414945
Total WU 181
In progress 56
Pending 50
Inconclusive 38
Valid 37
Invalid 0
Error 0
ID: 31194 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,381,453
RAC: 102,769
Message 31201 - Posted: 30 Jun 2017, 15:55:14 UTC

For all my hosts the numbers are:


Pending 66.03%
Inconclusive 15.31%
Vaild 18.66%

There is still a big backlog on the validator it appears. I will make detailed analysis this weekend.
ID: 31201 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,381,453
RAC: 102,769
Message 31208 - Posted: 30 Jun 2017, 20:01:39 UTC

I took my last 100 Inconclusive results. Some trends

I have all windows, 87% was with a Linux Wingman

Most common CPU's each with 10 AMD FX-8300 (3.19.0-32-generic), E5-2630 v3 (2.6.32-642.15.1.el6.x86_64, Windows 8.1 & Windows 10) & E5-2699C v4 (4.1.12-94.3.5.el7uek.x86_64)

43% have CPU time less than 50sec, 25% above 1000sec on the wingman

Of the short one there is high probabilty that the linux host fails in much shorter time than the windows host. For long one they are much closer together.
ID: 31208 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1173
Credit: 54,785,584
RAC: 15,020
Message 31209 - Posted: 30 Jun 2017, 20:06:59 UTC

Right now I am at......

23 Valid
22 Validation inconclusive
179 Validation pending
ID: 31209 · Report as offensive     Reply Quote
Chooka
Avatar

Send message
Joined: 11 Feb 13
Posts: 22
Credit: 20,728,480
RAC: 31
Message 31213 - Posted: 1 Jul 2017, 0:58:37 UTC

Yes I've noticed the my inconclusives rising. Currently sitting at

Validation Pending - 236
Validation Inconclusive - 168
Valid - down from 100 yesterday to currently 83.

https://lhcathome.cern.ch/lhcathome/results.php?userid=250933
ID: 31213 · Report as offensive     Reply Quote
Profile John Hunt

Send message
Joined: 13 Jul 05
Posts: 133
Credit: 162,641
RAC: 0
Message 31237 - Posted: 2 Jul 2017, 17:34:30 UTC

Any light at the end of tunnel yet?

Situation here is as follows:-
Host 10414945
Total WU 513
In progress 77
Pending 80
Inconclusive 190
Valid 166
Invalid 0
Error 0
ID: 31237 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 31238 - Posted: 2 Jul 2017, 17:55:58 UTC

The number to look at is the number of Invalid results rather than Inconclusives. If you have zero Invalids then there is nothing wrong with your host. You will need to wait a little longer for the resent job to be returned but this is just similar to having a job Pending from a slower wingman or someone with a larger cache of work than their machine can return in reasonable time.
Eric has been working hard at isolating rogue hosts and optimising the validation process. Better to get an inconclusive result and a short delay than having a bad result falsely validated as good.
ID: 31238 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31240 - Posted: 2 Jul 2017, 20:00:47 UTC - in response to Message 31237.  

I am now waiting for management/support to take action. Eric.
ID: 31240 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31241 - Posted: 2 Jul 2017, 20:02:22 UTC - in response to Message 31238.  

Thanks Ray, "short" delay is too long for me, but I am hoping for some management/support action tomorrow latest. Eric.

The number to look at is the number of Invalid results rather than Inconclusives. If you have zero Invalids then there is nothing wrong with your host. You will need to wait a little longer for the resent job to be returned but this is just similar to having a job Pending from a slower wingman or someone with a larger cache of work than their machine can return in reasonable time.
Eric has been working hard at isolating rogue hosts and optimising the validation process. Better to get an inconclusive result and a short delay than having a bad result falsely validated as good.

ID: 31241 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,381,453
RAC: 102,769
Message 31302 - Posted: 5 Jul 2017, 21:30:54 UTC
Last modified: 5 Jul 2017, 21:32:57 UTC

Things are looking much better:

Error 0.02%
Invalid 0.05%
Valid 43.12%
inconclusive 25.58%
Pending 31.23%

Looks like the re-runs are comming though and a 2nd wingman validates.
ID: 31302 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31304 - Posted: 6 Jul 2017, 7:08:48 UTC - in response to Message 31302.  

Thanks Toby, and I can also report that overnight "No consensus"
have dropped 20,516 -> 16,660 -> 12,281 (last 24 hours)
and 445,068 -> 444,151 -> 443,080 (last 7 days). Still some way to
go but patience is called for.

I should also finally, after3 years!, be able to make a proper analysis
of the real invalid results and of other problems.

Things are looking much better:

Error 0.02%
Invalid 0.05%
Valid 43.12%
inconclusive 25.58%
Pending 31.23%

Looks like the re-runs are coming though and a 2nd wingman validates.

ID: 31304 · Report as offensive     Reply Quote
Demis

Send message
Joined: 6 Mar 12
Posts: 7
Credit: 3,130,996
RAC: 0
Message 31305 - Posted: 6 Jul 2017, 9:13:31 UTC

What's wrong with my statistics?

#https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=70891665
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10483219&offset=80&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10415523&offset=0&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10482091&offset=160&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10450260&offset=20&show_names=0&state=3&appid=
ID: 31305 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31306 - Posted: 6 Jul 2017, 10:30:48 UTC - in response to Message 31305.  

Will look at this soonest. Very interesting and useful. Eric.

What's wrong with my statistics?
#https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=70891665
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10483219&offset=80&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10415523&offset=0&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10482091&offset=160&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10450260&offset=20&show_names=0&state=3&appid=

ID: 31306 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,381,453
RAC: 102,769
Message 31308 - Posted: 6 Jul 2017, 17:46:40 UTC - in response to Message 31305.  

most are waiting for a 3rd host to re-run WU and validate or not.

for 70891665, no one could agree so it's marked as can't validate as the maxiumium number of tries is 5. In this case it's probally a bad WU as the probabilty of 4 different computers being bad is very low.

Eric could comment on how a WU could be bad from a fundemental perspective
ID: 31308 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31311 - Posted: 6 Jul 2017, 21:47:31 UTC - in response to Message 31308.  

Well this is the BIG problem. We had a three hour discussion today, but not
everyone, if anyone, agrees with me. We are seeing too many bad/null results, mixed up
results, but still getting a lot done. I think this a good thing on the long term BUT we must identify the source of these bad/null results. I myself suspect corrupted input and/or
output with good reason. Sadly I guess after max tries the WU will be abandoned. The
work unit is fine though. So no credits for these, and the CERN user will have to re-submit.

Nonetheless I now see a few cases with 3 runs, one rejected and two good results.
I also see 553,615 valid results in the last 7 days. Great, but not good enough.

I plead my case, I am not responsible for these errors nor for all the transient errors we have seen. (I did stop/start the validator on24th June and problems started around 03:00
on 25th.) The sixtrack_validator has been much improved since then by my colleagues.
A side effect is the problem you, and I, and others have seen. If we are lucky and you don't get a 3rd bad result all will be well.

The number of inconclusive, "No consensus" is decreasing but too slowly. I think we must
identify the source of these bad/null/invalid results. (Just my opinion.) The much improved
validator at least allows us to clearly identify the bad tasks.
(Atlas has the same or very similar problem; not sure they know that!)

I have NEVER found a Fundamental problem with a Work Unit. They may exist but
the CERN users are responsible people who check their initial conditions before
submitting to BOINC/LHC@home.

At least the number of inconclusive is down to 5,939 for last 24 hours and while decreasing
is still at 406573 for the last 7 days. but also decreasing. This is in fact good news I think as we are down by almost 40,000. Patience, which I am sadly lacking, is required all round. We live in hope and are working hard to sort this out. Eric.

most are waiting for a 3rd host to re-run WU and validate or not.

for 70891665, no one could agree so it's marked as can't validate as the maxiumium number of tries is 5. In this case it's probally a bad WU as the probability of 4 different computers being bad is very low.

Eric could comment on how a WU could be bad from a fundemental perspective

ID: 31311 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31312 - Posted: 6 Jul 2017, 21:49:34 UTC - in response to Message 31306.  

Please see my reply http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=4306&postid=31311

Is that a hyperlink????

I have the same problem myself with some wzero/jtbb2cm1 cases.

Will look at this soonest. Very interesting and useful. Eric.

What's wrong with my statistics?
#https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=70891665
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10483219&offset=80&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10415523&offset=0&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10482091&offset=160&show_names=0&state=3&appid=
#https://lhcathome.cern.ch/lhcathome/results.php?hostid=10450260&offset=20&show_names=0&state=3&appid=

ID: 31312 · Report as offensive     Reply Quote
Demis

Send message
Joined: 6 Mar 12
Posts: 7
Credit: 3,130,996
RAC: 0
Message 31313 - Posted: 7 Jul 2017, 7:03:04 UTC - in response to Message 31312.  
Last modified: 7 Jul 2017, 7:40:21 UTC

Eric, thank you so much for your huge work you do for all of us.
Did not expect that this simple question would cause such heated debate.

db4m2

( wherein :
db - [D]istributed [B]oinc
4 - [F]or
m - [M]e
2 - [T]o )

errors Too many total results
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=72453472

Today is Friday.
Need to change
db - [D]istributed [B]oinc
to
db - [D]ouble [B]eer
ID: 31313 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31315 - Posted: 7 Jul 2017, 11:01:11 UTC - in response to Message 31311.  

I have finally understood what you and my colleagues have been trying to tell me! We are in the unfortunate situation where we are unable to distinguish between an infrastructure failure and a case where all particles are lost before completing (typically) a thousand turns. I am a bit slow sometimes and I am obsessed by the unexplained failures of which there are currently too many.

To try and be clear, if I can:

SixTrack has three phases, pre-processing, tracking and post-processing.
Under the present setup, and at the request of a colleague more than ten
years ago, a SixTrack run on LHC@home returns only one file fort.10 as
a result from the post-processing phase.
This was done in order to reduce network traffic and the load on the server.

fort.10, the result, has one very long line for each particle, 60 double
precision floating-point numbers per line, normally 60 lines around 40 KiloBytes.

IF

we have a bad set of initial conditions (very unlikely, but possible) leading to a pre-processing failure

OR

more likely all particles are lost in tracking before completing typically a thousand turns

OR

we never perform post-processing for some other reason

THEN

SixTrack stops and returns an empty fort.10 result.

ENDIF

BUT

Infrastructure failures or run time errors may also produce a null empty result file.

The sixtrack_validator now rejects such null results but clearly identifies
it has done so.

We cannot distinguish between a genuine all particles lost or some other
failure, like a run time crash, segment violation, etc.

These cases will now never be Validated and after 5 attempts the Work Unit
will be scrapped. That is the bad news. Random infrastructure failures should normally be rejected and invalidated as we expect 2 other Valid results.

The relatively good news is that since all particles are lost, little CPU
time, real time, or credits are wasted.

PHEW.... my colleagues are currently trying to test a new SixTrack version
which will allow us to clearly distinguish the reason for a null/empty
result file and our problem will be solved and we should in particular
be able to identify the source of empty results which are not due to
all particles lost.

Summary, No action is required by you. A new SixTrack will be released as
soon as possible. Eric.
ID: 31315 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2532
Credit: 253,722,201
RAC: 34,439
Message 31317 - Posted: 7 Jul 2017, 11:49:54 UTC - in response to Message 31315.  

Is it possible to do the following workaround?
IF

we have a bad set of initial conditions (very unlikely, but possible) leading to a pre-processing failure
Insert an error code in the fort.10

OR

more likely all particles are lost in tracking before completing typically a thousand turns
Insert an error code in the fort.10

OR

we never perform post-processing for some other reason
Insert an error code in the fort.10

THEN

SixTrack stops and returns an empty fort.10 result.
Now SixTrack stops and returns fort.10 with error code(s).

ENDIF

BUT

Infrastructure failures or run time errors may also produce a null empty result file.
Now they can be identified.

The sixtrack_validator now rejects such null results but clearly identifies
it has done so.

The validator (or a separate script) strips the error code lines if there are valid result lines.

ID: 31317 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Sixtrack Application : Inconclusive, valid/invalid results


©2024 CERN