Message boards : Number crunching : Validation Pendind since 02.JUN.2019
Message board moderation

To post messages, you must log in.

AuthorMessage
jokerdm

Send message
Joined: 14 Jul 05
Posts: 3
Credit: 4,932,597
RAC: 0
Message 39343 - Posted: 12 Jul 2019, 15:23:36 UTC

Hello.
I have about 380 jobs "Validation Pending" since 02-JUN-2019.
Validator server tells than it hasn't jobs ti process, so ¿Is it normal?
Greetings
ID: 39343 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,463,569
RAC: 132,354
Message 39344 - Posted: 12 Jul 2019, 15:41:09 UTC - in response to Message 39343.  

Since your computers are hidden nobody can look into the logs to give you specific help.
You may make them visible for other volunteers at this page:
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project
ID: 39344 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39355 - Posted: 14 Jul 2019, 14:08:44 UTC - in response to Message 39343.  

Hello, jokerdm,
without having access to the list of tasks crunched by your hosts, I cannot tell much.
Please anyway keep in mind that, in presence of such a long backlog, in case two tasks out of the same WU cannot be validated, it will take quite some time before the third one is sent out and crunched. This might be at the origin of your (not yet) validated tasks.
Hope it helps,
Cheers,
A.
ID: 39355 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,494,952
RAC: 2,243
Message 39356 - Posted: 14 Jul 2019, 19:03:44 UTC - in response to Message 39355.  

Hope it helps,
Cheers,
A.
What surely would help, is when a 'resend' (3rd, 4th wingman) is needed,
that special created task is placed in front of the queue and not at the end.
This is normal BOINC-practice, but not at LHC.

Example workunit https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=116328856

232058584 10409137 12 Jun 2019, 9:39:27 UTC 14 Jun 2019, 2:06:26 UTC Completed, waiting for validation 7,291.11 7,283.58 pending SixTrack v502.05 (sse2) x86_64-pc-linux-gnu
232058585 10589358 12 Jun 2019, 9:31:41 UTC 23 Jun 2019, 1:41:11 UTC Not started by deadline - canceled 0.00 0.00 --- SixTrack v502.05 (sse2) windows_x86_64
233682174 10452031 26 Jun 2019, 11:09:54 UTC 3 Jul 2019, 6:49:49 UTC Not started by deadline - canceled 0.00 0.00 --- SixTrack v502.05 (avx) windows_intelx86
236591721 --- --- --- Unsent --- --- --- ---

That last Unsent task is directly created after the second 'No reply' on 3 Jul 2019, 6:49:54 UTC, but not sent as soon as a client user is requesting new work.
ID: 39356 · Report as offensive     Reply Quote
jokerdm

Send message
Joined: 14 Jul 05
Posts: 3
Credit: 4,932,597
RAC: 0
Message 39359 - Posted: 15 Jul 2019, 15:51:30 UTC - in response to Message 39344.  

Hi,
Thanks for the feedback, i change it, now is shown.
ID: 39359 · Report as offensive     Reply Quote
jokerdm

Send message
Joined: 14 Jul 05
Posts: 3
Credit: 4,932,597
RAC: 0
Message 39360 - Posted: 15 Jul 2019, 15:58:58 UTC - in response to Message 39356.  

Hello,

Mistery solved, that´s what happens to my "Validation pending WUs"
The elder one shows:
Task
click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Credit Application
230325739 10529993 1 Jun 2019, 7:45:14 UTC 2 Jun 2019, 8:24:22 UTC Completed, waiting for validation 53,767.04 53,673.03 pending SixTrack v502.05 (sse2) x86_64-pc-linux-gnu
230325740 10589829 1 Jun 2019, 7:55:56 UTC 7 Jun 2019, 23:29:05 UTC Not started by deadline - canceled 0.00 0.00 --- SixTrack v502.05 (avx) windows_intelx86
231788770 10586430 10 Jun 2019, 16:47:58 UTC 18 Jun 2019, 8:20:12 UTC Timed out - no response 0.00 0.00 --- SixTrack v502.05 (sse2) windows_intelx86
233525413 10555237 25 Jun 2019, 8:00:52 UTC 2 Jul 2019, 23:33:06 UTC Timed out - no response 0.00 0.00 --- SixTrack v502.05 (avx) windows_intelx86
236581119 --- --- --- Unsent --- --- --- ---

So, I just have bad luck with the other users whom process the same WU....

Thank you very much!

Greetings!
ID: 39360 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,463,569
RAC: 132,354
Message 39361 - Posted: 15 Jul 2019, 16:15:10 UTC - in response to Message 39359.  

Hi,
Thanks for the feedback, i change it, now is shown.

OK. Now your computers are visible.
This gives the following picture:


Regarding SixTrack:
There's nothing to complain.
You attached lots of cores and got lots of tasks.
Your error rate is close to 0.
As each task needs a 2nd valid result to confirm your result just lean back and wait until another computer reports this 2nd result.


Regarding ATLAS/Theory native:
Both require CVMFS (ATLAS also Singularity) to be locally installed on your computers.
Otherwise you will get nothing but errors.
See this threads for help:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4840
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4971
ID: 39361 · Report as offensive     Reply Quote
Profile Agus

Send message
Joined: 27 Jan 16
Posts: 2
Credit: 1,007,223
RAC: 0
Message 39464 - Posted: 30 Jul 2019, 6:32:56 UTC - in response to Message 39343.  

Hello.
I have about 380 jobs "Validation Pending" since 02-JUN-2019.
Validator server tells than it hasn't jobs ti process, so ¿Is it normal?
Greetings


Hi.

I have got a similar issue. I have about 40 jobs "Validating Pending" since 22-JUL-2019 and Validator server tells the same; No jobs for validating.
I'm stopping to crunch for LHC@Home until these 40 jobs are validated and I can check if they are OK or error.
Regards
ID: 39464 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,463,569
RAC: 132,354
Message 39467 - Posted: 30 Jul 2019, 7:53:23 UTC - in response to Message 39464.  

Your results are valid.
What they need is a 2nd valid result from another computer.
Once that computer successfully reports the same task your result will change it's status to "valid".
ID: 39467 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39469 - Posted: 30 Jul 2019, 14:37:31 UTC - in response to Message 39356.  

Hi,
thanks to computezrmle for the correct replies.

Concerning the comment by Crystal Pellet:

What surely would help, is when a 'resend' (3rd, 4th wingman) is needed,
that special created task is placed in front of the queue and not at the end.
This is normal BOINC-practice, but not at LHC.

This would simplify a lot the life of SixTrack users - let's see the IT experts.
Cheers,
A.
ID: 39469 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 39470 - Posted: 30 Jul 2019, 16:01:50 UTC
Last modified: 30 Jul 2019, 16:41:19 UTC

The "prioritising of resends" question has been with us forever. It's something that can really only be addressed with a change to the Boinc server code but since Boinc itself is administered by volunteers, there seems little enthusiasm to fiddle with it. The only current option available is to resend to "reliable" hosts, under the Accelerating Resends section.
Boinc in its current configuration will always put resends to the back of the "current" queue so another option would be to release work in smaller batches and allow the queue to "almost" run dry before releasing the next batch. Resends would then go to the end of that first batch and should have higher priority than subsequent batches. Obviously that would result in more manual intervention by the staff, who are probably busy doing other stuff.

Thinking out loud:
Rather than releasing 500,000 WUs all together, could a script be set up to release, say 100,000, monitor the queue until it gets to, say 100, then release a further 100,000 and so on?
ID: 39470 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 463
Message 39482 - Posted: 2 Aug 2019, 9:22:31 UTC - in response to Message 39470.  

The "prioritising of resends" question has been with us forever.
It's been less of an issue here though as until recently SixTrack work came in discrete batches lasting barely a fortnight, giving the system a chance to catch up.

I'm running less SixTrack ATM so I'm not sure what the overall proportion of inconclusives is, but I still have a bunch of WUs over a week old waiting for re-sends after others' tasks got lost.
This is more important now as chaining million-turn jobs together to make a 10^7 (or more) turn calculation will get badly bogged down if you start getting six-week pauses between steps!
ID: 39482 · Report as offensive     Reply Quote

Message boards : Number crunching : Validation Pendind since 02.JUN.2019


©2024 CERN