Message boards : Number crunching : Stuck validation inconclusive
Message board moderation

To post messages, you must log in.

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 801
Credit: 649,526,018
RAC: 244,918
Message 26534 - Posted: 23 May 2014, 23:59:25 UTC

I got a couple of WU's stuck with validation inconclusive

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17490395
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17445763
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17444083
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17441053
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17440832
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17419979
http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17283010

Not sure what's happening, they should send out additional but they seem to just flop to unsent then get stuck?
ID: 26534 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26535 - Posted: 24 May 2014, 1:11:28 UTC - in response to Message 26534.  
Last modified: 24 May 2014, 1:15:05 UTC

Not a problem on your host, your wingman has already been reported in this posting.

At some point your workunits will be delivered a third time and when those come back, your results should validate.

The server side scheduler seems to place redeliveries at the end of the queue and with the currently quite long running workunits it might take some time until they moved to the top of the queue.
ID: 26535 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 801
Credit: 649,526,018
RAC: 244,918
Message 26536 - Posted: 24 May 2014, 3:08:28 UTC

I didn't notice it was the same wingman in all of them!

it seems like the resends should be at the front?
ID: 26536 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26537 - Posted: 24 May 2014, 5:57:36 UTC - in response to Message 26535.  

Thanks Ananas; doesn't seem too good like that and
probably also explains why we tend to have a tail of
incomplete cases on our side. I'll see if we can't improve
that. Anyway queue was empty this morning so hopefully
this should be sorted out. More work is coming. Eric. .
ID: 26537 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26539 - Posted: 24 May 2014, 8:20:19 UTC - in response to Message 26537.  
Last modified: 24 May 2014, 8:36:40 UTC

I received a few redelivered ones a few hours ago, they can easily be recognized because most have a shorter deadline (download errors seem not to reduce the deadline) and all do not end with _0 or _1.
ID: 26539 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 801
Credit: 649,526,018
RAC: 244,918
Message 26542 - Posted: 24 May 2014, 12:59:36 UTC

I think there is a server setting to put re-tries 1st in the queue. Richard might know.
ID: 26542 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 26567 - Posted: 26 May 2014, 22:45:25 UTC - in response to Message 26542.  

I think there is a server setting to put re-tries 1st in the queue. Richard might know.

I don't think there's a queue-order setting: my guess is that queue-ordering is primarily governed by primary-key indexing on ResultID - there's not a lot you can do about that.

There is a selection of controls which might help to ensure that once the resends at the end of the queue finally reach the front and start getting issued, they spend as little extra time 'in the wild' as possible.

Project Options - Accelerating retries

That includes sending the resend tasks to hosts with (historically) a fast turnround time - the meaning of 'fast' being configurable - and also reducing the deadline for return.
ID: 26567 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 801
Credit: 649,526,018
RAC: 244,918
Message 26570 - Posted: 27 May 2014, 12:40:59 UTC

Got another couple of results that seem different not the xxx504 host.

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17592766

Not sure how CPU time can be > Run time??

This is strange too

http://lhcathomeclassic.cern.ch/sixtrack/workunit.php?wuid=17625043

the CPU times align but they don validate?
ID: 26570 · Report as offensive     Reply Quote
grumpy

Send message
Joined: 1 Sep 04
Posts: 57
Credit: 2,832,517
RAC: 95
Message 26571 - Posted: 27 May 2014, 12:42:25 UTC

How about 5952 and 257 inconclusive from these:

http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10137504

http://lhcathomeclassic.cern.ch/sixtrack/results.php?hostid=10172067

ID: 26571 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 801
Credit: 649,526,018
RAC: 244,918
Message 26577 - Posted: 28 May 2014, 3:03:57 UTC - in response to Message 26570.  

The both went through today, new wingman fixed things.
ID: 26577 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 26614 - Posted: 3 Jul 2014, 18:43:30 UTC

Still problems on hostid=10137504 noted previously.
10 valid results but 2973 inconclusive and 2352 pending. Definitely something amiss with only "<core_client_version>7.2.28</core_client_version>" in the stderr.
Aqvario's other machines look ok, just this one where almost all wus only last a few seconds and don't produce a full stderr.
They've been here a long time but maybe doesn't pay attention to each machine.
ID: 26614 · Report as offensive     Reply Quote

Message boards : Number crunching : Stuck validation inconclusive


©2024 CERN