Message boards : Theory Application : New Version v300.05
Message board moderation

Previous · 1 · 2 · 3 · 4

AuthorMessage
Erich56

Joined: 18 Dec 15
Posts: 1516
Credit: 46,216,801
RAC: 58,006
Message 41569 - Posted: 14 Feb 2020, 17:05:37 UTC - in response to Message 41547.

I wrote:
Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running.

Crystal Pellet wrote:
I'm not an expert, but for Windows you could use this, if you have BOINC in the default installation path.

:START

For /F "Delims=" %%g In ('FindStr/ILMC:" sherpa " *.run 2^>Nul') Do (
"C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort
"C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort
"C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort
)

timeout /t 60 /NOBREAK >NUL
goto START

exit

Put the AbortSherpa.bat in the lhc project folder and start it from there. It will loop forever.
It tries to abort version 0, 1 and 2 of a Sherpa-task, because I don't want to make it too complicated to figure out, whether it was a resend.
For the versions not present BOINC will report an error like: GUI RPC error: no such result
I tried this on one of my several machines so far, and indeed I got the message "GUI RPC error: no such result - Operation failed: Error -1" - twice the same text. I didnt 'quite understand what you meant by saying "For the versions not present BOINC will report an error like: GUI RPC error: no such result" - please enlighten me :-)
Further: I guess while starting this .bat, BOINC must be closed, correct? At least that's what I did.
ID: 41569 ·
Crystal Pellet
Volunteer moderator
Volunteer tester

Joined: 14 Jan 10
Posts: 1121
Credit: 6,900,903
RAC: 1,202
Message 41572 - Posted: 14 Feb 2020, 17:45:09 UTC - in response to Message 41569.

I tried this on one of my several machines so far, and indeed I got the message "GUI RPC error: no such result - Operation failed: Error -1" - twice the same text. I didnt 'quite understand what you meant by saying "For the versions not present BOINC will report an error like: GUI RPC error: no such result" - please enlighten me :-)
Mostly the BOINC-task name ends with _0. With a resend task it ends with _1 or even _2. I want to catch at least 3 possibilities, although only 1 can be true, hence the 2 error messages.
You can suppress the error messages by using my renewed script:
:START
@echo off

For /F "Delims=" %%g in ('findStr/ilmc:" sherpa " *.run 2^>nul') do (
"C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort 2>NUL
"C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort 2>NUL
"C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort 2>NUL
echo "%date% %time%: Theory_%%~ng (sherpa job) aborted"
timeout /t 1 /NOBREAK >NUL
"C:\Program Files\BOINC\boinccmd" --project https://lhcathome.cern.ch/lhcathome update
)

timeout /t 120 /NOBREAK >NUL
goto START

exit


Program Files ersetzen durch Programme.

Further: I guess while starting this .bat, BOINC must be closed, correct? At least that's what I did.
Not needed at all, but when you are running this batch job before LHC get Theory-tasks, it could end up with a starting sherpa task, so first have some other LHC-tasks running or tasks from other projects that occupy the cores.
ID: 41572 ·
Erich56

Joined: 18 Dec 15
Posts: 1516
Credit: 46,216,801
RAC: 58,006
Message 41576 - Posted: 14 Feb 2020, 18:19:02 UTC - in response to Message 41572.

Program Files ersetzen durch Programme.
that's what I did anyway
ID: 41576 ·
Crystal Pellet
Volunteer moderator
Volunteer tester

Joined: 14 Jan 10
Posts: 1121
Credit: 6,900,903
RAC: 1,202
Message 41612 - Posted: 17 Feb 2020, 8:57:09 UTC - in response to Message 41555.

Ray Murray wrote:
I like to give most Sherpas the benefit of the doubt, . . . . .
From all 67604 run-parameter combinations there are 2141 with the sherpa generator. 613 of them never had a successful result (no events processed).
Those 613 are part of the 4929 out of the 67604 possible combinations that never had a result with processed events.
3599 are from the kind pythia8 with version 8.301 I mentioned in the thread Only errors running Pythia8 with version 8.301
They run short and ended with the error code 1. The faulty sherpa's however seem to run endless. I put them in a list together with pythia8 8.301 (as long not fixed), with 716 other unsuccessful combinations.
When a new BOINC-task arrives, I scan the runspec of that task against that list. When matched, I abort the task before it ever started, so no waste of time.
All sherpa's not on that list have a chance to finish successful.
ID: 41612 ·
Henry Nebrensky

Joined: 13 Jul 05
Posts: 160
Credit: 14,665,538
RAC: 0
Message 41619 - Posted: 18 Feb 2020, 10:00:28 UTC - in response to Message 41557.

what I am having now on console_2 with a Sherpa that's been running for 1 day 10 hours:

Y out of bounds
ISR_Handler::makeISR(..): s' out of bounds

so I guess this task is faulty, too, isn't it?

No idea - us volunteers just get to stare at the log files and guess... I think I've seen these sort of warning messages (ISR out of bounds, inaccurate rotation, etc.) from "successful" tasks, just as I occasionally see similar messages from Pythia, Herwig and so on. It's hard to be sure as successful tasks delete their log files so the sampling is very biased! I'm still deciding based on whether the time prediction is consistent: those lines should still be appearing in the logs.

For example, task 263982585 is presently spewing a stream of
METS_Scale_Setter::SetScales(): Failed to determine \mu.
but I'm letting it run as the time left looks to be reasonable:
Event 10000 ( 4h 31m 53s elapsed / 20h 23m 29s left ) -> ETA: Wed Feb 19 02:49

Note that the tasks have a series of internal phases, each with a separate time prediction, so I'm not sure what should happen at 3am tomorrow, whether the task finishes or just starts another phase, and I won't be staying up late to find out!

(edit: Thinking about it, another danger is probably the volume of the warning messages making the log file grow too large and thus the task getting killed for using too much disk space, rather than for a failure within the code itself).
ID: 41619 ·
Harri Liljeroos

Joined: 28 Sep 04
Posts: 589
Credit: 33,835,104
RAC: 19,998
Message 41621 - Posted: 18 Feb 2020, 11:16:13 UTC - in response to Message 41619.

I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems.
ID: 41621 ·
Crystal Pellet
Volunteer moderator
Volunteer tester

Joined: 14 Jan 10
Posts: 1121
Credit: 6,900,903
RAC: 1,202
Message 41622 - Posted: 18 Feb 2020, 11:59:26 UTC - in response to Message 41621.

I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems.
Does let me think about the 'Big Bang' and no one there to abort the failed experiment ;)

btw: When you have Theory running in the default setup, a task will be aborted after 100 hours elapsed time.
ID: 41622 ·
Erich56

Joined: 18 Dec 15
Posts: 1516
Credit: 46,216,801
RAC: 58,006
Message 41623 - Posted: 18 Feb 2020, 11:59:41 UTC - in response to Message 41621.

I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems.
that's what I strongly doubt. I've never seen this so far :-(
ID: 41623 ·
Crystal Pellet
Volunteer moderator
Volunteer tester

Joined: 14 Jan 10
Posts: 1121
Credit: 6,900,903
RAC: 1,202
Message 41624 - Posted: 18 Feb 2020, 12:02:10 UTC - in response to Message 41619.

(edit: Thinking about it, another danger is probably the volume of the warning messages making the log file grow too large and thus the task getting killed for using too much disk space, rather than for a failure within the code itself).
In your results you may see that the disk usage normally is up to ~2 GB/task.
A task will be aborted when the disk usage is more than 7.5GB
ID: 41624 ·
Henry Nebrensky

Joined: 13 Jul 05
Posts: 160
Credit: 14,665,538
RAC: 0
Message 41625 - Posted: 18 Feb 2020, 12:06:16 UTC - in response to Message 41621.

I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems.
I think that the problem with the loooong-runners is that they would - in principle - eventually finish correctly, but not within a timescale compatible with volunteer computing: the task deadline is only 10 days, and eventually the tasks get purged from the DB after which any returned results are useless. Tasks that run for 100+ days aren't viable.

Of course, maybe then if Sherpa's internal predictions are reliable then it should self-destruct if the predicted runtime is over some threshold. I've seen predicted runtimes of more than 6000 days = 16 years - that's well past the lifetime of the OS and pushing the lifetime of the hardware for most of us!
ID: 41625 ·
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester

Joined: 15 Jun 08
Posts: 2017
Credit: 147,819,172
RAC: 116,007
Message 41626 - Posted: 18 Feb 2020, 12:28:15 UTC - in response to Message 41625.

I've seen predicted runtimes of more than 6000 days = 16 years...

May be just a hint that we should buy more powerful hardware.
;-D

BTW:
Had a sherpa task with a predicted runtime of nearly 20000 days (increasing) last week.
(Don't want to start a race. I guess it depends on when you look into the logfiles.)
ID: 41626 ·
S@NL - John van Gorsel

Joined: 8 Aug 11
Posts: 3
Credit: 2,433,581
RAC: 222
Message 41627 - Posted: 18 Feb 2020, 16:27:55 UTC

I have one task that is currently at 93,94% after running for a little over 90 hours. It now shows in my task list as "Timed out -no response". With 6 more hours to completion, does this still add value to science or is this just wasted time and should I abandon this task?
ID: 41627 ·
Crystal Pellet
Volunteer moderator
Volunteer tester

Joined: 14 Jan 10
Posts: 1121
Credit: 6,900,903
RAC: 1,202
Message 41628 - Posted: 18 Feb 2020, 18:32:58 UTC - in response to Message 41627.

@John: Welcome to the forum!

Let the task run. It will be killed after 100 hours of elapsed time when not ready before.
ID: 41628 ·
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester

Joined: 15 Jun 08
Posts: 2017
Credit: 147,819,172
RAC: 116,007
Message 41629 - Posted: 19 Feb 2020, 7:18:52 UTC - in response to Message 41627.

A result from a task that is marked as "Timed out -no response" in the server DB will not be used any more.
Hence it makes no sense to let it run.

currently at 93,94% ... 6 more hours to completion

These are shown by the BOINC client I guess.
They are not reliable in case of longrunners since BOINC calculates it based on averages from previous results.
ID: 41629 ·
Crystal Pellet
Volunteer moderator
Volunteer tester

Joined: 14 Jan 10
Posts: 1121
Credit: 6,900,903
RAC: 1,202
Message 41631 - Posted: 19 Feb 2020, 8:32:32 UTC - in response to Message 41629.

A result from a task that is marked as "Timed out -no response" in the server DB will not be used any more.
Hence it makes no sense to let it run.
Normal BOINC behaviour is:

When you return a valid result after the deadline and before a wingman has returned a valid result of the resend, your result will be used and credit granted.
When a wingman returns his resend after you and before his deadline, he will get credit too.
ID: 41631 ·
S@NL - John van Gorsel

Joined: 8 Aug 11
Posts: 3
Credit: 2,433,581
RAC: 222
Message 41642 - Posted: 19 Feb 2020, 16:14:43 UTC - in response to Message 41631.

[Normal BOINC behaviour is:

When you return a valid result after the deadline and before a wingman has returned a valid result of the resend, your result will be used and credit granted.
When a wingman returns his resend after you and before his deadline, he will get credit too.

I was the second to receive my workunit https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=131787969 but neither my predecessor nor I received credit. The workunit now went to a third person who will likely chew on it for 100 hours. This looks like an awful waste of resources :-(
ID: 41642 ·
S@NL - John van Gorsel

Joined: 8 Aug 11
Posts: 3
Credit: 2,433,581
RAC: 222
Message 41836 - Posted: 6 Mar 2020, 17:29:27 UTC - in response to Message 41642.

The workunit now went to a third person who will likely chew on it for 100 hours. This looks like an awful waste of resources :-(

And so it happened. Task now ended as "Too many total results" so 3 people wasted 100 hours each on this task :-(
ID: 41836 ·
Previous · 1 · 2 · 3 · 4

Message boards : Theory Application : New Version v300.05