Thread 'New Version v300.05'

Author	Message
maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,823,324 RAC: 1,886	Message 41491 - Posted: 7 Feb 2020, 10:47:49 UTC - in response to Message 41489. In the past it was running on a IBM Z-series or a Cray ;-)) ID: 41491 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41492 - Posted: 7 Feb 2020, 17:34:34 UTC - in response to Message 41490. The decision is up to you. 17 h is far away from the 100 h limit, so you may give it 1-2 more days and check the log from time to time to see whether the task recovers. Nobody can guarantee that it will succeed but you will get more familiar with sherpa's output and this will also be a success. what caught my eye just now is that I am having the same problem on two other Sherpas, too. Console F2 contains a line where it says "...697 d ... left", and on another task even "2253 d ... left". I don't know whether this can/must be taken for granted or not. In general I can say that I keep having many troubles with Sherpa tasks. I guess I will decide to check any new task as to whether it's Sherpa, and if so, I'll abort it immediately. Thus avoing hours and days of unneccesary CPU time. To my experience, Sherpas are not working well to a high degree. Their code should be re-written or abandoned. ID: 41492 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2285 Credit: 178,823,324 RAC: 1,886	Message 41495 - Posted: 8 Feb 2020, 2:04:42 UTC - in response to Message 41492. In general I can say that I keep having many troubles with Sherpa tasks. I guess I will decide to check any new task as to whether it's Sherpa, and if so, I'll abort it immediately. Thus avoing hours and days of unneccesary CPU time. To my experience, Sherpas are not working well to a high degree. Their code should be re-written or abandoned. We need more Informations about the Problems of Sherpa. So, every running Task is useful. Wasting time of our Computer is sometime so. It must be possible to get contact to the programmer of Sherpa from Cern. ID: 41495 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 41496 - Posted: 8 Feb 2020, 12:18:47 UTC - in response to Message 41492. Console F2 contains a line where it says "...697 d ... left", and on another task even "2253 d ... left". I don't know whether this can/must be taken for granted or not. My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step. The other failure mode I've seen recently is that the Sherpa (native, in my case) keeps running at 100% of a CPU but stops writing to the runRivet.log - I've been aborting those after ~24hrs of non-feedback. (I do save the runRivet.log just before aborting, but there doesn't seem to be anywhere sensible to upload them to. I doubt that filling these fora with 300kB files helps anything.) In general I can say that I keep having many troubles with Sherpa tasks. I did suggest that they be given their own sub-project precisely because they need so much extra baby-sitting, but there doesn't seem to have been any follow-up to the project's request for comments. ID: 41496 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41497 - Posted: 8 Feb 2020, 13:08:03 UTC - in response to Message 41496. In general I can say that I keep having many troubles with Sherpa tasks. I did suggest that they be given their own sub-project precisely because they need so much extra baby-sitting, but there doesn't seem to have been any follow-up to the project's request for comments. I fully agree that Sherpas would deserve their own subproject. There are so many problems related with Sherpas that we crunchers should be able to choose not to download them. A few minutes ago, I aborted a task which has run for 10 days 3 hours 50 minutes - so it was almost 4 hours beyond the deadline anyway. I guess it got hung up in some kind of loop; from the stderr, however, I am not able to see what exactly the problem was (I am not expert enough).: https://lhcathome.cern.ch/lhcathome/result.php?resultid=260819418 Then I abortet another one which was running for 3 days 3 hours and in F2 was showing this "inaccurate rotation" thing which I had mentioned in an earlier posting here. https://lhcathome.cern.ch/lhcathome/result.php?resultid=262030907 - however, stderr does not show much :-( and then there was one running for 7 days: https://lhcathome.cern.ch/lhcathome/result.php?resultid=261292850 - again, stderr doesn't show much either. At any rate, for the time being, I'll abort all Sherpas. They are too faulty :-( Crunching them is a waste of CPU. ID: 41497 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1521 Credit: 10,013,911 RAC: 1,240	Message 41501 - Posted: 9 Feb 2020, 9:56:03 UTC - in response to Message 41497. At any rate, for the time being, I'll abort all Sherpas. They are too faulty :-( Crunching them is a waste of CPU. Several Sherpa attempts: 11294 6025 success 2549 failure 2720 lost From the 2141 different Sherpa parameter types 645 were unsuccessful. ID: 41501 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41505 - Posted: 9 Feb 2020, 15:47:19 UTC - in response to Message 41501. rather high failure rate, isn't it? ID: 41505 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41528 - Posted: 11 Feb 2020, 12:57:00 UTC - in response to Message 41496. Henry Nebrensky wrote: ... My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step. On one of my systems a Sherpa started about 1 1/2 hours ago. Under console 2 there is a line (among others) starting with "integration time" - this time is in line with the clock time on the wall; however, the time left is INcreasing (NOT DEcreasing) - right now it shows around 300 days :-( So my guess is that also this Sherpa is faulty and I should kill it, right? ID: 41528 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 6,584	Message 41531 - Posted: 11 Feb 2020, 13:45:45 UTC - in response to Message 41528. I don't know the method sherpa uses to calculate the remaining time but it is most likely a statistical forecast. Common to most of those statistical forecasts is a high uncertainty risk right at the beginning of the calculation when only a few values are available. The forecasts become much more reliable when >10% of the output is available or >10% of the estimated time is over. Since Theory's watchdog is configured to stop the task after 100h it makes sense to look into it after 10h. 1.5h appears to be too early. ID: 41531 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41532 - Posted: 11 Feb 2020, 14:10:27 UTC - in response to Message 41531. Thanks for your comments! With these Sherpas, it's indeed not easy :-) As I mentioned earlier, my intention now, anyway, is to kill any Sherpa right away. But in this case, for some reason I wanted to give it a chance. However, I am not too confidential anyway. Also, it's kind of hard to keep track of newly started Sherpas on serveral machines running Theory. That's why I was thinking of some kind of script which would kill a Sherpa immediately after it starts. However, I am not good enough in writing scripts, while such a special one is definitely not that easy to write; at least that's what I guess. ID: 41532 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 41534 - Posted: 11 Feb 2020, 16:49:49 UTC - in response to Message 41531. Hi, Since Theory's watchdog is configured to stop the task after 100h it makes sense to look into it after 10h. But the watchdog is there precisely to catch faulty tasks. On my machines the bigger successful Sherpas take a bit over a day, albeit with a tail of long-runners into the 4-7 day range (but those still reported consistent time predictions). I agree you initially need to wait a bit for the estimate to converge but by ~2hrs into an expected ~20hr job a prediction of ~200days would be out by two orders of magnitude, rather than some %. 1.5h appears to be too early. I do try to give them longer than that (at least 2-5hrs) in case they do recover but it's a personal decision. Also, I tend to be more likely to give questionable tasks the benefit of the doubt if I expect to be able to check on them again in the near future, whereas if it's the last chance to do anything before a long weekend say then I'll be more brutal. ID: 41534 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41537 - Posted: 12 Feb 2020, 6:20:58 UTC - in response to Message 41528. Henry Nebrensky wrote: ... My experience is that successful Sherpas - even long-runners - do actually finish as predicted (to within a minute!). So I've been aborting any tasks where the time left exceeds 100 days, a state usually reached in a couple of hours. The reported time elapsed should be increasing in line with the clock on the wall, and the time left decreasing in step. I wrote: On one of my systems a Sherpa started about 1 1/2 hours ago. Under console 2 there is a line (among others) starting with "integration time" - this time is in line with the clock time on the wall; however, the time left is INcreasing (NOT DEcreasing) - right now it shows around 300 days :-( So my guess is that also this Sherpa is faulty and I should kill it, right? this morning console 2 showed more than 1000 days under "time left" - so I killed it immediately. Sherpa tasks are really junk - sorry to say this :-( Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running. ID: 41537 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1521 Credit: 10,013,911 RAC: 1,240	Message 41547 - Posted: 12 Feb 2020, 15:18:18 UTC - in response to Message 41537. Last modified: 12 Feb 2020, 15:39:41 UTC Again my question to the experts here: is there a way, by a script or something else, by which a Sherpa task can be abortet right after start? I would definitely need something like this. I am not able to check back every few hours, at different times, on several machines, whether a Sherpa is running. I'm not an expert, but for Windows you could use this, if you have BOINC in the default installation path. It will abort Sherpa-tasks even before the start when you have a few Theory-tasks 'Ready to Start' in your queue. :START For /F "Delims=" %%g In ('FindStr/ILMC:" sherpa " *.run 2^>Nul') Do ( "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_0 abort "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_1 abort "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome Theory_%%~ng_2 abort ) timeout /t 60 /NOBREAK >NUL goto START exit Put the AbortSherpa.bat in the lhc project folder and start it from there. It will loop forever. It tries to abort version 0, 1 and 2 of a Sherpa-task, because I don't want to make it too complicated to figure out, whether it was a resend. For the versions not present BOINC will report an error like: GUI RPC error: no such result ID: 41547 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41549 - Posted: 12 Feb 2020, 18:33:03 UTC - in response to Message 41547. many thanks, C.P., for the script. Just one question, to be sure to do the right thing: you say "C:\Program Files\BOINC\boinccmd" - in my German version of Windows this folder is called "Programme" - so should I replace "Program Files" by "Programme"? (from what I think I've read somewhere though, any language version of Windows internally uses "Program Files" - in which case I should leave "Program Files" unchanged - is this correct?) ID: 41549 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1521 Credit: 10,013,911 RAC: 1,240	Message 41550 - Posted: 12 Feb 2020, 18:53:07 UTC - in response to Message 41549. I think you have to change Program Files into Programme to be able to find boinccmd.exe. ID: 41550 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 41555 - Posted: 12 Feb 2020, 21:21:25 UTC I like to give most Sherpas the benefit of the doubt, letting them run sufficiently long for the estimated time remaining (in F2 that is. You can safely ignore the Boinc estimate) to have settled down. If it is Decreasing then mostly things are going well. If it is Increasing, or has already gotten to hundreds of days, or ones with Inaccurate Rotation, then there is clearly a problem and I bin it. (I still haven't figured out how to do that gracefully so I just Abort them) Obviously such babysitting isn't practical for those with multiple machines and many cores to view every job that comes in but I only have 3 hosts with a total of 10 cores between them. I extended the allowed time to 10 days to let this one run as it looked OK with realistic estimate and it finished in 9 days 10+1/2 hrs. ID: 41555 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41557 - Posted: 13 Feb 2020, 7:31:08 UTC - in response to Message 41555. what I am having now on console_2 with a Sherpa that's been running for 1 day 10 hours: Y out of bounds ISR_Handler::makeISR(..): s' out of bounds so I guess this task is faulty, too, isn't it? ID: 41557 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41561 - Posted: 14 Feb 2020, 8:41:22 UTC - in response to Message 41557. ... so I guess this task is faulty, too, isn't it? noone has any advice for me? ID: 41561 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 299,002,782 RAC: 6,584	Message 41563 - Posted: 14 Feb 2020, 9:14:18 UTC - in response to Message 41561. ... so I guess this task is faulty, too, isn't it? noone has any advice for me? Same advice as always. Either let it run until it recovers (or not) or kill the task. It's your computer, hence your decision. Nobody will be able to reliably tell you what this task will do on your computer just from 2 logfile lines. ID: 41563 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1947 Credit: 158,524,775 RAC: 87,778	Message 41564 - Posted: 14 Feb 2020, 12:39:09 UTC - in response to Message 41563. Nobody will be able to reliably tell you what this task will do on your computer just from 2 logfile lines. yes, of course :-) I was just hoping that there is a chance that someone could clearly judge the meaning of "Y out of bounds - ISR_Handler::makeISR(..): s' out of bounds", maybe in a sense that this task is definitely faulty, ... ID: 41564 · Reply Quote