Message boards :
Theory Application :
New Version v300.05
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 169 |
In the past it was running on a IBM Z-series or a Cray ;-)) |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
The decision is up to you.what caught my eye just now is that I am having the same problem on two other Sherpas, too. Console F2 contains a line where it says "...697 d ... left", and on another task even "2253 d ... left". I don't know whether this can/must be taken for granted or not. In general I can say that I keep having many troubles with Sherpa tasks. I guess I will decide to check any new task as to whether it's Sherpa, and if so, I'll abort it immediately. Thus avoing hours and days of unneccesary CPU time. To my experience, Sherpas are not working well to a high degree. Their code should be re-written or abandoned. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 169 |
In general I can say that I keep having many troubles with Sherpa tasks. I guess I will decide to check any new task as to whether it's Sherpa, and if so, I'll abort it immediately. Thus avoing hours and days of unneccesary CPU time. To my experience, Sherpas are not working well to a high degree. Their code should be re-written or abandoned. We need more Informations about the Problems of Sherpa. So, every running Task is useful. Wasting time of our Computer is sometime so. It must be possible to get contact to the programmer of Sherpa from Cern. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 1 |
Console F2 contains a line where it says "...697 d ... left", and on another task even "2253 d ... left". I don't know whether this can/must be taken for granted or not.My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step. The other failure mode I've seen recently is that the Sherpa (native, in my case) keeps running at 100% of a CPU but stops writing to the runRivet.log - I've been aborting those after ~24hrs of non-feedback. (I do save the runRivet.log just before aborting, but there doesn't seem to be anywhere sensible to upload them to. I doubt that filling these fora with 300kB files helps anything.) In general I can say that I keep having many troubles with Sherpa tasks.I did suggest that they be given their own sub-project precisely because they need so much extra baby-sitting, but there doesn't seem to have been any follow-up to the project's request for comments. |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
I fully agree that Sherpas would deserve their own subproject.In general I can say that I keep having many troubles with Sherpa tasks.I did suggest that they be given their own sub-project precisely because they need so much extra baby-sitting, but there doesn't seem to have been any follow-up to the project's request for comments. There are so many problems related with Sherpas that we crunchers should be able to choose not to download them. A few minutes ago, I aborted a task which has run for 10 days 3 hours 50 minutes - so it was almost 4 hours beyond the deadline anyway. I guess it got hung up in some kind of loop; from the stderr, however, I am not able to see what exactly the problem was (I am not expert enough).: https://lhcathome.cern.ch/lhcathome/result.php?resultid=260819418 Then I abortet another one which was running for 3 days 3 hours and in F2 was showing this "inaccurate rotation" thing which I had mentioned in an earlier posting here. https://lhcathome.cern.ch/lhcathome/result.php?resultid=262030907 - however, stderr does not show much :-( and then there was one running for 7 days: https://lhcathome.cern.ch/lhcathome/result.php?resultid=261292850 - again, stderr doesn't show much either. At any rate, for the time being, I'll abort all Sherpas. They are too faulty :-( Crunching them is a waste of CPU. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 470 |
At any rate, for the time being, I'll abort all Sherpas. They are too faulty :-( Crunching them is a waste of CPU.Several Sherpa attempts: 11294 6025 success 2549 failure 2720 lost From the 2141 different Sherpa parameter types 645 were unsuccessful. |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
rather high failure rate, isn't it? |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
Henry Nebrensky wrote: ... My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step.On one of my systems a Sherpa started about 1 1/2 hours ago. Under console 2 there is a line (among others) starting with "integration time" - this time is in line with the clock time on the wall; however, the time left is INcreasing (NOT DEcreasing) - right now it shows around 300 days :-( So my guess is that also this Sherpa is faulty and I should kill it, right? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 12,857 |
I don't know the method sherpa uses to calculate the remaining time but it is most likely a statistical forecast. Common to most of those statistical forecasts is a high uncertainty risk right at the beginning of the calculation when only a few values are available. The forecasts become much more reliable when >10% of the output is available or >10% of the estimated time is over. Since Theory's watchdog is configured to stop the task after 100h it makes sense to look into it after 10h. 1.5h appears to be too early. |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
Thanks for your comments! With these Sherpas, it's indeed not easy :-) As I mentioned earlier, my intention now, anyway, is to kill any Sherpa right away. But in this case, for some reason I wanted to give it a chance. However, I am not too confidential anyway. Also, it's kind of hard to keep track of newly started Sherpas on serveral machines running Theory. That's why I was thinking of some kind of script which would kill a Sherpa immediately after it starts. However, I am not good enough in writing scripts, while such a special one is definitely not that easy to write; at least that's what I guess. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 1 |
Hi, Since Theory's watchdog is configured to stop the task after 100h it makes sense to look into it after 10h.But the watchdog is there precisely to catch faulty tasks. On my machines the bigger successful Sherpas take a bit over a day, albeit with a tail of long-runners into the 4-7 day range (but those still reported consistent time predictions). I agree you initially need to wait a bit for the estimate to converge but by ~2hrs into an expected ~20hr job a prediction of ~200days would be out by two orders of magnitude, rather than some %. 1.5h appears to be too early.I do try to give them longer than that (at least 2-5hrs) in case they do recover but it's a personal decision. Also, I tend to be more likely to give questionable tasks the benefit of the doubt if I expect to be able to check on them again in the near future, whereas if it's the last chance to do anything before a long weekend say then I'll be more brutal. |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
Henry Nebrensky wrote:this morning console 2 showed more than 1000 days under "time left" - so I killed it immediately. Sherpa tasks are really junk - sorry to say this :-(... My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step.I wrote: Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 470 |
Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running.I'm not an expert, but for Windows you could use this, if you have BOINC in the default installation path. It will abort Sherpa-tasks even before the start when you have a few Theory-tasks 'Ready to Start' in your queue. :START For /F "Delims=" %%g In ('FindStr/ILMC:" sherpa " *.run 2^>Nul') Do ( "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort ) timeout /t 60 /NOBREAK >NUL goto START exitPut the AbortSherpa.bat in the lhc project folder and start it from there. It will loop forever. It tries to abort version 0, 1 and 2 of a Sherpa-task, because I don't want to make it too complicated to figure out, whether it was a resend. For the versions not present BOINC will report an error like: GUI RPC error: no such result |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
many thanks, C.P., for the script. Just one question, to be sure to do the right thing: you say "C:\Program Files\BOINC\boinccmd" - in my German version of Windows this folder is called "Programme" - so should I replace "Program Files" by "Programme"? (from what I think I've read somewhere though, any language version of Windows internally uses "Program Files" - in which case I should leave "Program Files" unchanged - is this correct?) |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 470 |
I think you have to change Program Files into Programme to be able to find boinccmd.exe. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
I like to give most Sherpas the benefit of the doubt, letting them run sufficiently long for the estimated time remaining (in F2 that is. You can safely ignore the Boinc estimate) to have settled down. If it is Decreasing then mostly things are going well. If it is Increasing, or has already gotten to hundreds of days, or ones with Inaccurate Rotation, then there is clearly a problem and I bin it. (I still haven't figured out how to do that gracefully so I just Abort them) Obviously such babysitting isn't practical for those with multiple machines and many cores to view every job that comes in but I only have 3 hosts with a total of 10 cores between them. I extended the allowed time to 10 days to let this one run as it looked OK with realistic estimate and it finished in 9 days 10+1/2 hrs. |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
what I am having now on console_2 with a Sherpa that's been running for 1 day 10 hours: Y out of bounds ISR_Handler::makeISR(..): s' out of bounds so I guess this task is faulty, too, isn't it? |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
... so I guess this task is faulty, too, isn't it?noone has any advice for me? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 12,857 |
... so I guess this task is faulty, too, isn't it?noone has any advice for me? Same advice as always. Either let it run until it recovers (or not) or kill the task. It's your computer, hence your decision. Nobody will be able to reliably tell you what this task will do on your computer just from 2 logfile lines. |
Send message Joined: 18 Dec 15 Posts: 1824 Credit: 119,090,926 RAC: 18,984 |
Nobody will be able to reliably tell you what this task will do on your computer just from 2 logfile lines.yes, of course :-) I was just hoping that there is a chance that someone could clearly judge the meaning of "Y out of bounds - ISR_Handler::makeISR(..): s' out of bounds", maybe in a sense that this task is definitely faulty, ... |
©2025 CERN