Message boards :
Theory Application :
New Version v300.05
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,089,505 RAC: 61,291 |
I wrote: I tried this on one of my several machines so far, and indeed I got the message "GUI RPC error: no such result - Operation failed: Error -1" - twice the same text. I didnt 'quite understand what you meant by saying "For the versions not present BOINC will report an error like: GUI RPC error: no such result" - please enlighten me :-)Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running. Further: I guess while starting this .bat, BOINC must be closed, correct? At least that's what I did. |
Send message Joined: 14 Jan 10 Posts: 1431 Credit: 9,594,942 RAC: 7,433 |
I tried this on one of my several machines so far, and indeed I got the message "GUI RPC error: no such result - Operation failed: Error -1" - twice the same text. I didnt 'quite understand what you meant by saying "For the versions not present BOINC will report an error like: GUI RPC error: no such result" - please enlighten me :-)Mostly the BOINC-task name ends with _0. With a resend task it ends with _1 or even _2. I want to catch at least 3 possibilities, although only 1 can be true, hence the 2 error messages. You can suppress the error messages by using my renewed script: :START @echo off For /F "Delims=" %%g in ('findStr/ilmc:" sherpa " *.run 2^>nul') do ( "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort 2>NUL "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort 2>NUL "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort 2>NUL echo "%date% %time%: Theory_%%~ng (sherpa job) aborted" timeout /t 1 /NOBREAK >NUL "C:\Program Files\BOINC\boinccmd" --project https://lhcathome.cern.ch/lhcathome update ) timeout /t 120 /NOBREAK >NUL goto START exit Program Files ersetzen durch Programme. Further: I guess while starting this .bat, BOINC must be closed, correct? At least that's what I did.Not needed at all, but when you are running this batch job before LHC get Theory-tasks, it could end up with a starting sherpa task, so first have some other LHC-tasks running or tasks from other projects that occupy the cores. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,089,505 RAC: 61,291 |
Program Files ersetzen durch Programme.that's what I did anyway |
Send message Joined: 14 Jan 10 Posts: 1431 Credit: 9,594,942 RAC: 7,433 |
Ray Murray wrote: I like to give most Sherpas the benefit of the doubt, . . . . .From all 67604 run-parameter combinations there are 2141 with the sherpa generator. 613 of them never had a successful result (no events processed). Those 613 are part of the 4929 out of the 67604 possible combinations that never had a result with processed events. 3599 are from the kind pythia8 with version 8.301 I mentioned in the thread Only errors running Pythia8 with version 8.301 They run short and ended with the error code 1. The faulty sherpa's however seem to run endless. I put them in a list together with pythia8 8.301 (as long not fixed), with 716 other unsuccessful combinations. When a new BOINC-task arrives, I scan the runspec of that task against that list. When matched, I abort the task before it ever started, so no waste of time. All sherpa's not on that list have a chance to finish successful. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 0 |
what I am having now on console_2 with a Sherpa that's been running for 1 day 10 hours: No idea - us volunteers just get to stare at the log files and guess... I think I've seen these sort of warning messages (ISR out of bounds, inaccurate rotation, etc.) from "successful" tasks, just as I occasionally see similar messages from Pythia, Herwig and so on. It's hard to be sure as successful tasks delete their log files so the sampling is very biased! I'm still deciding based on whether the time prediction is consistent: those lines should still be appearing in the logs. For example, task 263982585 is presently spewing a stream of METS_Scale_Setter::SetScales(): Failed to determine \mu.but I'm letting it run as the time left looks to be reasonable: Event 10000 ( 4h 31m 53s elapsed / 20h 23m 29s left ) -> ETA: Wed Feb 19 02:49 Note that the tasks have a series of internal phases, each with a separate time prediction, so I'm not sure what should happen at 3am tomorrow, whether the task finishes or just starts another phase, and I won't be staying up late to find out! (edit: Thinking about it, another danger is probably the volume of the warning messages making the log file grow too large and thus the task getting killed for using too much disk space, rather than for a failure within the code itself). |
Send message Joined: 28 Sep 04 Posts: 737 Credit: 50,051,110 RAC: 33,076 |
I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems. |
Send message Joined: 14 Jan 10 Posts: 1431 Credit: 9,594,942 RAC: 7,433 |
I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems.Does let me think about the 'Big Bang' and no one there to abort the failed experiment ;) btw: When you have Theory running in the default setup, a task will be aborted after 100 hours elapsed time. |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,089,505 RAC: 61,291 |
I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems.that's what I strongly doubt. I've never seen this so far :-( |
Send message Joined: 14 Jan 10 Posts: 1431 Credit: 9,594,942 RAC: 7,433 |
(edit: Thinking about it, another danger is probably the volume of the warning messages making the log file grow too large and thus the task getting killed for using too much disk space, rather than for a failure within the code itself).In your results you may see that the disk usage normally is up to ~2 GB/task. A task will be aborted when the disk usage is more than 7.5GB |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 0 |
I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems.I think that the problem with the loooong-runners is that they would - in principle - eventually finish correctly, but not within a timescale compatible with volunteer computing: the task deadline is only 10 days, and eventually the tasks get purged from the DB after which any returned results are useless. Tasks that run for 100+ days aren't viable. Of course, maybe then if Sherpa's internal predictions are reliable then it should self-destruct if the predicted runtime is over some threshold. I've seen predicted runtimes of more than 6000 days = 16 years - that's well past the lifetime of the OS and pushing the lifetime of the hardware for most of us! |
Send message Joined: 15 Jun 08 Posts: 2556 Credit: 256,065,002 RAC: 85,771 |
I've seen predicted runtimes of more than 6000 days = 16 years... May be just a hint that we should buy more powerful hardware. ;-D BTW: Had a sherpa task with a predicted runtime of nearly 20000 days (increasing) last week. (Don't want to start a race. I guess it depends on when you look into the logfiles.) |
Send message Joined: 8 Aug 11 Posts: 5 Credit: 2,612,858 RAC: 0 |
I have one task that is currently at 93,94% after running for a little over 90 hours. It now shows in my task list as "Timed out -no response". With 6 more hours to completion, does this still add value to science or is this just wasted time and should I abandon this task? |
Send message Joined: 14 Jan 10 Posts: 1431 Credit: 9,594,942 RAC: 7,433 |
@John: Welcome to the forum! Let the task run. It will be killed after 100 hours of elapsed time when not ready before. |
Send message Joined: 15 Jun 08 Posts: 2556 Credit: 256,065,002 RAC: 85,771 |
A result from a task that is marked as "Timed out -no response" in the server DB will not be used any more. Hence it makes no sense to let it run. currently at 93,94% ... 6 more hours to completion These are shown by the BOINC client I guess. They are not reliable in case of longrunners since BOINC calculates it based on averages from previous results. |
Send message Joined: 14 Jan 10 Posts: 1431 Credit: 9,594,942 RAC: 7,433 |
A result from a task that is marked as "Timed out -no response" in the server DB will not be used any more.Normal BOINC behaviour is: When you return a valid result after the deadline and before a wingman has returned a valid result of the resend, your result will be used and credit granted. When a wingman returns his resend after you and before his deadline, he will get credit too. |
Send message Joined: 8 Aug 11 Posts: 5 Credit: 2,612,858 RAC: 0 |
[Normal BOINC behaviour is: I was the second to receive my workunit https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=131787969 but neither my predecessor nor I received credit. The workunit now went to a third person who will likely chew on it for 100 hours. This looks like an awful waste of resources :-( |
Send message Joined: 8 Aug 11 Posts: 5 Credit: 2,612,858 RAC: 0 |
And so it happened. Task now ended as "Too many total results" so 3 people wasted 100 hours each on this task :-( |
©2025 CERN