Thread 'New Version v300.05'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,097,100 RAC: 108,526	Message 41569 - Posted: 14 Feb 2020, 17:05:37 UTC - in response to Message 41547. I wrote: Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running. Crystal Pellet wrote: I'm not an expert, but for Windows you could use this, if you have BOINC in the default installation path. It will abort Sherpa-tasks even before the start when you have a few Theory-tasks 'Ready to Start' in your queue. :START For /F "Delims=" %%g In ('FindStr/ILMC:" sherpa " *.run 2^>Nul') Do ( "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort ) timeout /t 60 /NOBREAK >NUL goto START exit Put the AbortSherpa.bat in the lhc project folder and start it from there. It will loop forever. It tries to abort version 0, 1 and 2 of a Sherpa-task, because I don't want to make it too complicated to figure out, whether it was a resend. For the versions not present BOINC will report an error like: GUI RPC error: no such result I tried this on one of my several machines so far, and indeed I got the message "GUI RPC error: no such result - Operation failed: Error -1" - twice the same text. I didnt 'quite understand what you meant by saying "For the versions not present BOINC will report an error like: GUI RPC error: no such result" - please enlighten me :-) Further: I guess while starting this .bat, BOINC must be closed, correct? At least that's what I did. ID: 41569 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,934 RAC: 993	Message 41572 - Posted: 14 Feb 2020, 17:45:09 UTC - in response to Message 41569. I tried this on one of my several machines so far, and indeed I got the message "GUI RPC error: no such result - Operation failed: Error -1" - twice the same text. I didnt 'quite understand what you meant by saying "For the versions not present BOINC will report an error like: GUI RPC error: no such result" - please enlighten me :-) Mostly the BOINC-task name ends with _0. With a resend task it ends with _1 or even _2. I want to catch at least 3 possibilities, although only 1 can be true, hence the 2 error messages. You can suppress the error messages by using my renewed script: :START @echo off For /F "Delims=" %%g in ('findStr/ilmc:" sherpa " *.run 2^>nul') do ( "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort 2>NUL "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort 2>NUL "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort 2>NUL echo "%date% %time%: Theory_%%~ng (sherpa job) aborted" timeout /t 1 /NOBREAK >NUL "C:\Program Files\BOINC\boinccmd" --project https://lhcathome.cern.ch/lhcathome update ) timeout /t 120 /NOBREAK >NUL goto START exit Program Files ersetzen durch Programme. Further: I guess while starting this .bat, BOINC must be closed, correct? At least that's what I did. Not needed at all, but when you are running this batch job before LHC get Theory-tasks, it could end up with a starting sherpa task, so first have some other LHC-tasks running or tasks from other projects that occupy the cores. ID: 41572 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,097,100 RAC: 108,526	Message 41576 - Posted: 14 Feb 2020, 18:19:02 UTC - in response to Message 41572. Program Files ersetzen durch Programme. that's what I did anyway ID: 41576 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,934 RAC: 993	Message 41612 - Posted: 17 Feb 2020, 8:57:09 UTC - in response to Message 41555. Ray Murray wrote: I like to give most Sherpas the benefit of the doubt, . . . . . From all 67604 run-parameter combinations there are 2141 with the sherpa generator. 613 of them never had a successful result (no events processed). Those 613 are part of the 4929 out of the 67604 possible combinations that never had a result with processed events. 3599 are from the kind pythia8 with version 8.301 I mentioned in the thread Only errors running Pythia8 with version 8.301 They run short and ended with the error code 1. The faulty sherpa's however seem to run endless. I put them in a list together with pythia8 8.301 (as long not fixed), with 716 other unsuccessful combinations. When a new BOINC-task arrives, I scan the runspec of that task against that list. When matched, I abort the task before it ever started, so no waste of time. All sherpa's not on that list have a chance to finish successful. ID: 41612 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 41619 - Posted: 18 Feb 2020, 10:00:28 UTC - in response to Message 41557. Last modified: 18 Feb 2020, 10:05:54 UTC what I am having now on console_2 with a Sherpa that's been running for 1 day 10 hours: Y out of bounds ISR_Handler::makeISR(..): s' out of bounds so I guess this task is faulty, too, isn't it? No idea - us volunteers just get to stare at the log files and guess... I think I've seen these sort of warning messages (ISR out of bounds, inaccurate rotation, etc.) from "successful" tasks, just as I occasionally see similar messages from Pythia, Herwig and so on. It's hard to be sure as successful tasks delete their log files so the sampling is very biased! I'm still deciding based on whether the time prediction is consistent: those lines should still be appearing in the logs. For example, task 263982585 is presently spewing a stream of METS_Scale_Setter::SetScales(): Failed to determine \mu. but I'm letting it run as the time left looks to be reasonable: Event 10000 ( 4h 31m 53s elapsed / 20h 23m 29s left ) -> ETA: Wed Feb 19 02:49 Note that the tasks have a series of internal phases, each with a separate time prediction, so I'm not sure what should happen at 3am tomorrow, whether the task finishes or just starts another phase, and I won't be staying up late to find out! (edit: Thinking about it, another danger is probably the volume of the warning messages making the log file grow too large and thus the task getting killed for using too much disk space, rather than for a failure within the code itself). ID: 41619 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 793 Credit: 63,458,000 RAC: 25,218	Message 41621 - Posted: 18 Feb 2020, 11:16:13 UTC - in response to Message 41619. I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems. ID: 41621 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,934 RAC: 993	Message 41622 - Posted: 18 Feb 2020, 11:59:26 UTC - in response to Message 41621. I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems. Does let me think about the 'Big Bang' and no one there to abort the failed experiment ;) btw: When you have Theory running in the default setup, a task will be aborted after 100 hours elapsed time. ID: 41622 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1941 Credit: 156,097,100 RAC: 108,526	Message 41623 - Posted: 18 Feb 2020, 11:59:41 UTC - in response to Message 41621. I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems. that's what I strongly doubt. I've never seen this so far :-( ID: 41623 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,934 RAC: 993	Message 41624 - Posted: 18 Feb 2020, 12:02:10 UTC - in response to Message 41619. (edit: Thinking about it, another danger is probably the volume of the warning messages making the log file grow too large and thus the task getting killed for using too much disk space, rather than for a failure within the code itself). In your results you may see that the disk usage normally is up to ~2 GB/task. A task will be aborted when the disk usage is more than 7.5GB ID: 41624 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 41625 - Posted: 18 Feb 2020, 12:06:16 UTC - in response to Message 41621. Last modified: 18 Feb 2020, 12:07:43 UTC I hope that the tasks would be smart enough to abort themselves if the output lines mean severe problems. I think that the problem with the loooong-runners is that they would - in principle - eventually finish correctly, but not within a timescale compatible with volunteer computing: the task deadline is only 10 days, and eventually the tasks get purged from the DB after which any returned results are useless. Tasks that run for 100+ days aren't viable. Of course, maybe then if Sherpa's internal predictions are reliable then it should self-destruct if the predicted runtime is over some threshold. I've seen predicted runtimes of more than 6000 days = 16 years - that's well past the lifetime of the OS and pushing the lifetime of the hardware for most of us! ID: 41625 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 71,016	Message 41626 - Posted: 18 Feb 2020, 12:28:15 UTC - in response to Message 41625. I've seen predicted runtimes of more than 6000 days = 16 years... May be just a hint that we should buy more powerful hardware. ;-D BTW: Had a sherpa task with a predicted runtime of nearly 20000 days (increasing) last week. (Don't want to start a race. I guess it depends on when you look into the logfiles.) ID: 41626 · Reply Quote

S@NL - John van Gorsel Send message Joined: 8 Aug 11 Posts: 7 Credit: 2,715,739 RAC: 338	Message 41627 - Posted: 18 Feb 2020, 16:27:55 UTC I have one task that is currently at 93,94% after running for a little over 90 hours. It now shows in my task list as "Timed out -no response". With 6 more hours to completion, does this still add value to science or is this just wasted time and should I abandon this task? ID: 41627 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,934 RAC: 993	Message 41628 - Posted: 18 Feb 2020, 18:32:58 UTC - in response to Message 41627. @John: Welcome to the forum! Let the task run. It will be killed after 100 hours of elapsed time when not ready before. ID: 41628 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 71,016	Message 41629 - Posted: 19 Feb 2020, 7:18:52 UTC - in response to Message 41627. A result from a task that is marked as "Timed out -no response" in the server DB will not be used any more. Hence it makes no sense to let it run. currently at 93,94% ... 6 more hours to completion These are shown by the BOINC client I guess. They are not reliable in case of longrunners since BOINC calculates it based on averages from previous results. ID: 41629 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1491 Credit: 9,985,934 RAC: 993	Message 41631 - Posted: 19 Feb 2020, 8:32:32 UTC - in response to Message 41629. Last modified: 19 Feb 2020, 8:34:26 UTC A result from a task that is marked as "Timed out -no response" in the server DB will not be used any more. Hence it makes no sense to let it run. Normal BOINC behaviour is: When you return a valid result after the deadline and before a wingman has returned a valid result of the resend, your result will be used and credit granted. When a wingman returns his resend after you and before his deadline, he will get credit too. ID: 41631 · Reply Quote

S@NL - John van Gorsel Send message Joined: 8 Aug 11 Posts: 7 Credit: 2,715,739 RAC: 338	Message 41642 - Posted: 19 Feb 2020, 16:14:43 UTC - in response to Message 41631. [Normal BOINC behaviour is: When you return a valid result after the deadline and before a wingman has returned a valid result of the resend, your result will be used and credit granted. When a wingman returns his resend after you and before his deadline, he will get credit too. I was the second to receive my workunit https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=131787969 but neither my predecessor nor I received credit. The workunit now went to a third person who will likely chew on it for 100 hours. This looks like an awful waste of resources :-( ID: 41642 · Reply Quote

S@NL - John van Gorsel Send message Joined: 8 Aug 11 Posts: 7 Credit: 2,715,739 RAC: 338	Message 41836 - Posted: 6 Mar 2020, 17:29:27 UTC - in response to Message 41642. The workunit now went to a third person who will likely chew on it for 100 hours. This looks like an awful waste of resources :-( And so it happened. Task now ended as "Too many total results" so 3 people wasted 100 hours each on this task :-( ID: 41836 · Reply Quote