Message boards : Theory Application : New Version v300.05
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,150,336
RAC: 105,728
Message 41491 - Posted: 7 Feb 2020, 10:47:49 UTC - in response to Message 41489.  

In the past it was running on a IBM Z-series or a Cray ;-))
ID: 41491 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41492 - Posted: 7 Feb 2020, 17:34:34 UTC - in response to Message 41490.  

The decision is up to you.
17 h is far away from the 100 h limit, so you may give it 1-2 more days and check the log from time to time to see whether the task recovers.
Nobody can guarantee that it will succeed but you will get more familiar with sherpa's output and this will also be a success.
what caught my eye just now is that I am having the same problem on two other Sherpas, too.
Console F2 contains a line where it says "...697 d ... left", and on another task even "2253 d ... left". I don't know whether this can/must be taken for granted or not.

In general I can say that I keep having many troubles with Sherpa tasks. I guess I will decide to check any new task as to whether it's Sherpa, and if so, I'll abort it immediately. Thus avoing hours and days of unneccesary CPU time. To my experience, Sherpas are not working well to a high degree. Their code should be re-written or abandoned.
ID: 41492 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,150,336
RAC: 105,728
Message 41495 - Posted: 8 Feb 2020, 2:04:42 UTC - in response to Message 41492.  

In general I can say that I keep having many troubles with Sherpa tasks. I guess I will decide to check any new task as to whether it's Sherpa, and if so, I'll abort it immediately. Thus avoing hours and days of unneccesary CPU time. To my experience, Sherpas are not working well to a high degree. Their code should be re-written or abandoned.

We need more Informations about the Problems of Sherpa. So, every running Task is useful. Wasting time of our Computer is sometime so.
It must be possible to get contact to the programmer of Sherpa from Cern.
ID: 41495 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 41496 - Posted: 8 Feb 2020, 12:18:47 UTC - in response to Message 41492.  

Console F2 contains a line where it says "...697 d ... left", and on another task even "2253 d ... left". I don't know whether this can/must be taken for granted or not.
My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step.
The other failure mode I've seen recently is that the Sherpa (native, in my case) keeps running at 100% of a CPU but stops writing to the runRivet.log - I've been aborting those after ~24hrs of non-feedback.
(I do save the runRivet.log just before aborting, but there doesn't seem to be anywhere sensible to upload them to. I doubt that filling these fora with 300kB files helps anything.)

In general I can say that I keep having many troubles with Sherpa tasks.
I did suggest that they be given their own sub-project precisely because they need so much extra baby-sitting, but there doesn't seem to have been any follow-up to the project's request for comments.
ID: 41496 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41497 - Posted: 8 Feb 2020, 13:08:03 UTC - in response to Message 41496.  

In general I can say that I keep having many troubles with Sherpa tasks.
I did suggest that they be given their own sub-project precisely because they need so much extra baby-sitting, but there doesn't seem to have been any follow-up to the project's request for comments.
I fully agree that Sherpas would deserve their own subproject.
There are so many problems related with Sherpas that we crunchers should be able to choose not to download them.

A few minutes ago, I aborted a task which has run for 10 days 3 hours 50 minutes - so it was almost 4 hours beyond the deadline anyway. I guess it got hung up in some kind of loop; from the stderr, however, I am not able to see what exactly the problem was (I am not expert enough).:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=260819418

Then I abortet another one which was running for 3 days 3 hours and in F2 was showing this "inaccurate rotation" thing which I had mentioned in an earlier posting here.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=262030907 - however, stderr does not show much :-(

and then there was one running for 7 days: https://lhcathome.cern.ch/lhcathome/result.php?resultid=261292850 - again, stderr doesn't show much either.

At any rate, for the time being, I'll abort all Sherpas. They are too faulty :-( Crunching them is a waste of CPU.
ID: 41497 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 41501 - Posted: 9 Feb 2020, 9:56:03 UTC - in response to Message 41497.  

At any rate, for the time being, I'll abort all Sherpas. They are too faulty :-( Crunching them is a waste of CPU.
Several Sherpa attempts: 11294
6025 success
2549 failure
2720 lost

From the 2141 different Sherpa parameter types 645 were unsuccessful.
ID: 41501 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41505 - Posted: 9 Feb 2020, 15:47:19 UTC - in response to Message 41501.  

rather high failure rate, isn't it?
ID: 41505 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41528 - Posted: 11 Feb 2020, 12:57:00 UTC - in response to Message 41496.  

Henry Nebrensky wrote:
... My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step.
On one of my systems a Sherpa started about 1 1/2 hours ago. Under console 2 there is a line (among others) starting with "integration time" - this time is in line with the clock time on the wall; however, the time left is INcreasing (NOT DEcreasing) - right now it shows around 300 days :-(
So my guess is that also this Sherpa is faulty and I should kill it, right?
ID: 41528 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,957,310
RAC: 136,899
Message 41531 - Posted: 11 Feb 2020, 13:45:45 UTC - in response to Message 41528.  

I don't know the method sherpa uses to calculate the remaining time but it is most likely a statistical forecast.
Common to most of those statistical forecasts is a high uncertainty risk right at the beginning of the calculation when only a few values are available.
The forecasts become much more reliable when >10% of the output is available or >10% of the estimated time is over.

Since Theory's watchdog is configured to stop the task after 100h it makes sense to look into it after 10h.
1.5h appears to be too early.
ID: 41531 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41532 - Posted: 11 Feb 2020, 14:10:27 UTC - in response to Message 41531.  

Thanks for your comments!
With these Sherpas, it's indeed not easy :-)
As I mentioned earlier, my intention now, anyway, is to kill any Sherpa right away. But in this case, for some reason I wanted to give it a chance. However, I am not too confidential anyway.

Also, it's kind of hard to keep track of newly started Sherpas on serveral machines running Theory.
That's why I was thinking of some kind of script which would kill a Sherpa immediately after it starts. However, I am not good enough in writing scripts, while such a special one is definitely not that easy to write; at least that's what I guess.
ID: 41532 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 41534 - Posted: 11 Feb 2020, 16:49:49 UTC - in response to Message 41531.  

Hi,
Since Theory's watchdog is configured to stop the task after 100h it makes sense to look into it after 10h.
But the watchdog is there precisely to catch faulty tasks. On my machines the bigger successful Sherpas take a bit over a day, albeit with a tail of long-runners into the 4-7 day range (but those still reported consistent time predictions).
I agree you initially need to wait a bit for the estimate to converge but by ~2hrs into an expected ~20hr job a prediction of ~200days would be out by two orders of magnitude, rather than some %.
1.5h appears to be too early.
I do try to give them longer than that (at least 2-5hrs) in case they do recover but it's a personal decision. Also, I tend to be more likely to give questionable tasks the benefit of the doubt if I expect to be able to check on them again in the near future, whereas if it's the last chance to do anything before a long weekend say then I'll be more brutal.
ID: 41534 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41537 - Posted: 12 Feb 2020, 6:20:58 UTC - in response to Message 41528.  

Henry Nebrensky wrote:
... My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step.
I wrote:
On one of my systems a Sherpa started about 1 1/2 hours ago. Under console 2 there is a line (among others) starting with "integration time" - this time is in line with the clock time on the wall; however, the time left is INcreasing (NOT DEcreasing) - right now it shows around 300 days :-(
So my guess is that also this Sherpa is faulty and I should kill it, right?
this morning console 2 showed more than 1000 days under "time left" - so I killed it immediately. Sherpa tasks are really junk - sorry to say this :-(

Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running.
ID: 41537 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 41547 - Posted: 12 Feb 2020, 15:18:18 UTC - in response to Message 41537.  
Last modified: 12 Feb 2020, 15:39:41 UTC

Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running.
I'm not an expert, but for Windows you could use this, if you have BOINC in the default installation path.
It will abort Sherpa-tasks even before the start when you have a few Theory-tasks 'Ready to Start' in your queue.

:START

For /F "Delims=" %%g In ('FindStr/ILMC:" sherpa " *.run 2^>Nul') Do (
  "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort
  "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort 
  "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort
)

timeout /t 60 /NOBREAK >NUL
goto START

exit
Put the AbortSherpa.bat in the lhc project folder and start it from there. It will loop forever.
It tries to abort version 0, 1 and 2 of a Sherpa-task, because I don't want to make it too complicated to figure out, whether it was a resend.
For the versions not present BOINC will report an error like: GUI RPC error: no such result
ID: 41547 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41549 - Posted: 12 Feb 2020, 18:33:03 UTC - in response to Message 41547.  

many thanks, C.P., for the script.

Just one question, to be sure to do the right thing:
you say "C:\Program Files\BOINC\boinccmd" - in my German version of Windows this folder is called "Programme" - so should I replace "Program Files" by "Programme"? (from what I think I've read somewhere though, any language version of Windows internally uses "Program Files" - in which case I should leave "Program Files" unchanged - is this correct?)
ID: 41549 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 41550 - Posted: 12 Feb 2020, 18:53:07 UTC - in response to Message 41549.  

I think you have to change Program Files into Programme to be able to find boinccmd.exe.
ID: 41550 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 41555 - Posted: 12 Feb 2020, 21:21:25 UTC

I like to give most Sherpas the benefit of the doubt, letting them run sufficiently long for the estimated time remaining (in F2 that is. You can safely ignore the Boinc estimate) to have settled down. If it is Decreasing then mostly things are going well. If it is Increasing, or has already gotten to hundreds of days, or ones with Inaccurate Rotation, then there is clearly a problem and I bin it. (I still haven't figured out how to do that gracefully so I just Abort them) Obviously such babysitting isn't practical for those with multiple machines and many cores to view every job that comes in but I only have 3 hosts with a total of 10 cores between them.
I extended the allowed time to 10 days to let this one run as it looked OK with realistic estimate and it finished in 9 days 10+1/2 hrs.
ID: 41555 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41557 - Posted: 13 Feb 2020, 7:31:08 UTC - in response to Message 41555.  

what I am having now on console_2 with a Sherpa that's been running for 1 day 10 hours:

Y out of bounds
ISR_Handler::makeISR(..): s' out of bounds


so I guess this task is faulty, too, isn't it?
ID: 41557 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41561 - Posted: 14 Feb 2020, 8:41:22 UTC - in response to Message 41557.  

... so I guess this task is faulty, too, isn't it?
noone has any advice for me?
ID: 41561 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,957,310
RAC: 136,899
Message 41563 - Posted: 14 Feb 2020, 9:14:18 UTC - in response to Message 41561.  

... so I guess this task is faulty, too, isn't it?
noone has any advice for me?

Same advice as always.
Either let it run until it recovers (or not) or kill the task.
It's your computer, hence your decision.

Nobody will be able to reliably tell you what this task will do on your computer just from 2 logfile lines.
ID: 41563 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,398,944
RAC: 102,250
Message 41564 - Posted: 14 Feb 2020, 12:39:09 UTC - in response to Message 41563.  

Nobody will be able to reliably tell you what this task will do on your computer just from 2 logfile lines.
yes, of course :-)
I was just hoping that there is a chance that someone could clearly judge the meaning of "Y out of bounds - ISR_Handler::makeISR(..): s' out of bounds", maybe in a sense that this task is definitely faulty, ...
ID: 41564 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Theory Application : New Version v300.05


©2024 CERN