Message boards : Sixtrack Application : Inconclusive, valid/invalid results
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,369,412
RAC: 10,065
Message 31318 - Posted: 7 Jul 2017, 12:00:01 UTC - in response to Message 31317.  
Last modified: 7 Jul 2017, 12:00:16 UTC

This are excact my thaughts about this

Is it possible to do the following workaround?
[quote]IF

we have a bad set of initial conditions (very unlikely, but possible) leading to a pre-processing failure
Insert an error code in the fort.10

OR

....



Supporting BOINC, a great concept !
ID: 31318 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31319 - Posted: 7 Jul 2017, 17:15:28 UTC - in response to Message 31317.  

Yes it is being done in the new SixTrack under test.
This will be an enormous help in identifying problems and
doing correct validation.

HOWEVER as I have said many times to deaf ears this is NOT the
cause of all the "transient" errors which created our current problems.

Still, no consensus is down towards 330,000 now and we have over 700,000
validated. Eric


Is it possible to do the following workaround?
IF

we have a bad set of initial conditions (very unlikely, but possible) leading to a pre-processing failure
Insert an error code in the fort.10

OR

more likely all particles are lost in tracking before completing typically a thousand turns
Insert an error code in the fort.10

OR

we never perform post-processing for some other reason
Insert an error code in the fort.10

THEN

SixTrack stops and returns an empty fort.10 result.
Now SixTrack stops and returns fort.10 with error code(s).

ENDIF

BUT

Infrastructure failures or run time errors may also produce a null empty result file.
Now they can be identified.

The sixtrack_validator now rejects such null results but clearly identifies
it has done so.

The validator (or a separate script) strips the error code lines if there are valid result lines.


ID: 31319 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31321 - Posted: 8 Jul 2017, 4:27:22 UTC

There are three, 3, fundamental problems that
we are addressing with priority.

1. Why were Tasks not being distributed to the volunteers
during the Pentathlon and subsequently too many tasks
distributed to a volunteer?

2. Where did the 399,219 transient failures come from between
2017-06-25 02:35:53.4879 and 1st July. In addition there are
399,233 matching "couldn't open" messages for the result files.
The transients have disappeared since 4th July and we have only 150
(new message) try_open failures during this period.

3. Why the inadequate configuration of upload/download directorie
for ALL the LHC@home subprojects? Possible cause of transient failures
and making debugging impossibly slow.

Several fixes and improvemnts have been made to the sixtrack_validator.
The new Sixtrack resolving issues around empty result files
and inluding support for AVX and MacOS amongst many other things
is under test.

Anyway, Inconclusive/No Consensus are down to
20,734 last 24 hours and to 322,565 in the last seven days.

Yet more patience is required.
Eric.
ID: 31321 · Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 18 Sep 04
Posts: 30
Credit: 5,100,929
RAC: 0
Message 31322 - Posted: 8 Jul 2017, 5:04:03 UTC

Is this topic related to the fact that the server has cancelled 8 tasks on 3 of my well-performing machines yesterday?

Michael.
ID: 31322 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,093,506
RAC: 103,308
Message 31323 - Posted: 8 Jul 2017, 7:52:50 UTC

Have also eight canceled tasks by server yesterday.
Saw that the quorum was two, so the third task was obsolete.
ID: 31323 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 31324 - Posted: 8 Jul 2017, 8:16:56 UTC - in response to Message 31323.  

Have also eight canceled tasks by server yesterday.
Saw that the quorum was two, so the third task was obsolete.

I also had one cancelled, but strange is, that the original 2 were returned on the 20th and 25th of June and
the resend to me was send on the 7th of July and a few hours later cancelled.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=71160885

Was the SixTrack-validator so far behind?
ID: 31324 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31325 - Posted: 8 Jul 2017, 8:46:48 UTC - in response to Message 31324.  

With respect to cancelled tasks. I don't feel it is a problem.
Already got a validated result, so not needed I suspect.
There are plenty more to run. Also I believe too many tasks were distributed in
in response to the "WUs not being distributed problem".
I really think it will all sort itself out now but I suspect there may be some tasks
which will be sent 5 times and never validated.
They will all in this case be very short though.
This will be fixed in the next SixTrack Release.
Patience, patience and I just pray that our transient errors are gone for good.
Indeed the sixtrack_validator is way behind due to th e300,00 or so
"transient" errors a couple of weeks ago.


quote]
Have also eight canceled tasks by server yesterday.rtic.


Saw that the quorum was two, so the third task was obsolete.

I also had one cancelled, but strange is, that the original 2 were returned on the 20th and 25th of June and
the resend to me was send on the 7th of July and a few hours later cancelled.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=71160885

Was the SixTrack-validator so far behind?[/quote]
ID: 31325 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31326 - Posted: 8 Jul 2017, 8:47:52 UTC - in response to Message 31322.  

Yes, but not a problem I hope. Eric

Is this topic related to the fact that the server has cancelled 8 tasks on 3 of my well-performing machines yesterday?

Michael.

ID: 31326 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,093,506
RAC: 103,308
Message 31328 - Posted: 8 Jul 2017, 8:57:09 UTC

realy good things need time to grow ;-).
ID: 31328 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31329 - Posted: 8 Jul 2017, 13:00:24 UTC - in response to Message 31328.  

Well in this case I think the incubation period is a bit long! Eric
realy good things need time to grow ;-).

ID: 31329 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,924,227
RAC: 137,713
Message 31331 - Posted: 8 Jul 2017, 13:34:36 UTC - in response to Message 31329.  

Well in this case I think the incubation period is a bit long! Eric
realy good things need time to grow ;-).

This can only result in one conclusion:
This "thing" will become really good.
;-)
ID: 31331 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,093,506
RAC: 103,308
Message 31332 - Posted: 8 Jul 2017, 18:42:36 UTC

Eric,

is this possible - more than 40k sixtrack-tasks with such a small number of successful tasks:

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388131
ID: 31332 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31333 - Posted: 8 Jul 2017, 19:47:32 UTC - in response to Message 31332.  

Sadly I believe this is because of our over 300,000 infamous transient errors.
Looking at a couple of stderr I do not see any xml and we get nothing but an
empty result file which is rejected. However for at least one Work Unit I see

148257294 10486162 24 Jun 2017, 11:52:54 UTC 24 Jun 2017, 11:58:00 UTC Validate error 3.16 2.21 --- SixTrack v451.07 (sse2)
x86_64-pc-linux-gnu
148257295 10484663 24 Jun 2017, 11:55:00 UTC 24 Jun 2017, 11:57:42 UTC Completed, validation inconclusive 19.24 17.50 pending SixTrack v451.07 (pni)
x86_64-pc-linux-gnu
149752326 10138935 4 Jul 2017, 9:06:50 UTC 8 Jul 2017, 6:22:39 UTC Completed, validation inconclusive 76,873.79 62,879.10 pending SixTrack v451.07 (sse2)
windows_x86_64
150542647 10388131 8 Jul 2017, 6:25:16 UTC 8 Jul 2017, 6:34:07 UTC Completed, validation inconclusive 111.25 108.87 pending SixTrack v451.07 (sse2)
i686-pc-linux-gnu
150542796 10476113 8 Jul 2017, 6:35:06 UTC 15 Jul 2017, 22:07:20 UTC In progress --- --- --- SixTrack v451.07 (pni)
windows_x86_64

The result with some 60,000 seconds should eventually be validated, if we are lucky,
but we may exhaust the maximum of 5 attempts!

More likely we are running into another Task Management problem giving us a null result,
but from a science standpoint that is much much better than validating two duds!

Also if a volunteer loses patience and aborts a task I think it counts as 1 of our 5......

This is all pretty horrible, but it is in the lap of the gods, or at least in the hands of
my colleagues. Even worse the WWW response times are so bad that I can't easily
investigate further. I don't believe any particular host or hosts are responsible, although
we surely have some "bad" hosts.......

I might be able in the future to analyse the past couple of weeks and even compensate long running results which were wrongly invalidated.

Best I can do for now.

Down to 5.462 Inconclusive for last 24 hours, but still have 312,734 for the last seven days to be cleared out. Eric.

Eric,

is this possible - more than 40k sixtrack-tasks with such a small number of successful tasks:

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388131

ID: 31333 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31334 - Posted: 8 Jul 2017, 19:57:40 UTC - in response to Message 31332.  

P.S This host appears to have 40,353 results!!!
I'll sleep on it. I shall look at the host again tomorrow as there seem
to be a huge number of tasks with very very short run times.........


Eric,

is this possible - more than 40k sixtrack-tasks with such a small number of successful tasks:

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388131

ID: 31334 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31337 - Posted: 9 Jul 2017, 15:26:36 UTC - in response to Message 31334.  

Well, in spite of many other things today, I have somewhat progressed
but I am not at the end by any means.

I have indeed found 38,477 tasks/results for this host. They "all" appear
to be incredibly short (impossible for me) but also appear to have returned a
"success" result but which cannot be validated!!!

I continue checking and I am trying to find these results in the
BOINC server upload directory, or at least some of them.
I don't see how this can happen really...but I'll find out.
The tasks/results do not appear to be part of a particular workspace/study.
The host appears not too special (Linux). We shall see.

In the meantime we are down to 3517, down 10% in some hours,
Inconclusive/No consensus for last 24hrs and 219,481 for last 7 days.
Slow but progressing.
If I can find this it will be a breakthrough.

Very useful feedback.

Aborted by user 98,116 doesn't help, but is a user privilege.
May also avoid wasting your host's CPU time. Eric.

P.S This host appears to have 40,353 results!!!
I'll sleep on it. I shall look at the host again tomorrow as there seem
to be a huge number of tasks with very very short run times.........


Eric,

is this possible - more than 40k sixtrack-tasks with such a small number of successful tasks:

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388131

ID: 31337 · Report as offensive     Reply Quote
AlphaC

Send message
Joined: 6 Sep 13
Posts: 5
Credit: 1,286,288
RAC: 0
Message 31338 - Posted: 10 Jul 2017, 0:55:59 UTC - in response to Message 31308.  

I ran a few WUs and they ran fine on Ubuntu-based Linux Mint. However, it was in Virtualbox guest OS and not bare metal.

I did notice some WUs had 5 people crunching them...
ID: 31338 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31339 - Posted: 10 Jul 2017, 3:23:23 UTC - in response to Message 31338.  

I ran a few WUs and they ran fine on Ubuntu-based Linux Mint. However, it was in Virtualbox guest OS and not bare metal.

Interesting,...another hint. Note that Linux Kernel 5.8.0 is a necessary but not
sufficient condition to produce a SixTrack run time crash.

I did notice some WUs had 5 people crunching them...

Yes; this a clear indication of a problem. Any chance of naming a few....
I am missing something here....I must be able to find them myself from
some field in the WU table in the database.
Thanks. Eric.
ID: 31339 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,093,506
RAC: 103,308
Message 31340 - Posted: 10 Jul 2017, 6:44:20 UTC - in response to Message 31338.  

AlphaC,

please can you open your Computer-List.
ID: 31340 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31341 - Posted: 10 Jul 2017, 9:20:23 UTC - in response to Message 31340.  

AlphaC,

please can you open your Computer-List.


This "opening" would possibly be a great help.
(No need to be shy about member of Overclock.net :-)
Eric.
ID: 31341 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 31342 - Posted: 10 Jul 2017, 11:07:01 UTC - in response to Message 31341.  

AlphaC,

please can you open your Computer-List.


This "opening" would possibly be a great help.
(No need to be shy about member of Overclock.net :-)
Eric.


I don't think it matters, I found info in our Database.

I DID find a big source of Inconclusive/Invalid results which is NOT Linux Kernel 4.8.0
nor AlphaC
.
Very encouraging. Will post again soonest. Eric.
ID: 31342 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : Sixtrack Application : Inconclusive, valid/invalid results


©2024 CERN