Thread '(Native) Theory - Sherpa looooooong runners'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,217,443 RAC: 115,477	Message 38460 - Posted: 27 Mar 2019, 7:32:42 UTC - in response to Message 38459. So we should abort long running tasks? I can do that. What is the cut-off line? 3 hours? 5? Or? There's no reason to abort any task just because it's a longrunner. Even if it takes a couple of days to complete. It's Theory's vbox version that makes us believe that tasks must end within 18h. The native version gives you the opportunity to check the app's logfile and see if there is any progress. tail -fn500 /your_local_boinc_dir/slots/x/cernvm/shared/runRivet.log The only reason to abort a longrunner is if you shut down/reboot your computer every few hours and a task needs more than that time. In this case it would be better to run Theory inside a VM as it preserves it's state. ID: 38460 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 38462 - Posted: 27 Mar 2019, 8:30:33 UTC - in response to Message 38450. 2 days 5 hours runtime- yes we need some time for this "small" tasks. Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events? There is no one to one relationship between the number of events and the run time. I've noticed that the combination beam "ee" and "sherpa" rather often fails. If you want to avoid sherpa long-runners with your watchdog, just abort all sherpa's at the beginning. It's up to you. It's your time, your machine and your electricity. Someone else will do the job. Or maybe allow the user to select the limit. This is impossible. You cannot choose the kind of Theory generator, let alone the number of events. ID: 38462 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38464 - Posted: 27 Mar 2019, 11:10:53 UTC - in response to Message 38460. The native version gives you the opportunity to check the app's logfile and see if there is any progress. tail -fn500 /your_local_boinc_dir/slots/x/cernvm/shared/runRivet.log That's manual work most volunteers won't want to do especially if they run native Theory on more than 1 or 2 hosts. Hence the need for a watchdog script to automate the chore of checking for progress. ID: 38464 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 38466 - Posted: 27 Mar 2019, 12:00:23 UTC - in response to Message 38462. 2 days 5 hours runtime- yes we need some time for this "small" tasks. Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events? There is no one to one relationship between the number of events and the run time. I've noticed that the combination beam "ee" and "sherpa" rather often fails. If you want to avoid sherpa long-runners with your watchdog, just abort all sherpa's at the beginning. It's up to you. It's your time, your machine and your electricity. Someone else will do the job. And lots of those someone elses will run into the task deadline. What will be the response to their complaints... you may deselect native Theory and select Theory VBox instead? Or maybe allow the user to select the limit. This is impossible. You cannot choose the kind of Theory generator, let alone the number of events. I know we cannot choose the generator and number of events. I know those parameters are set by the server and they are immutable. I meant when the task starts the script compares the job's target events to the user's selected limit and if the target events exceeds the user's limit then the script aborts the task. It does not try to substitute the user's events limit for the target events value sent by the server. ID: 38466 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 38471 - Posted: 27 Mar 2019, 15:57:35 UTC I follow a few native task and one of them run sherpa. It was at runtime at 2days 1hour yesterday. Today i saw it got timed out. We might need to extend time to these sherpa:s. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109636098 ID: 38471 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,217,443 RAC: 115,477	Message 38486 - Posted: 29 Mar 2019, 11:39:53 UTC ttps://lhcathome.cern.ch/lhcathome/result.php?resultid=220001720[/url] CPU time now >2d 9h Due date was yesterday. [pre]===> [runRivet] Mon Mar 25 21:09:50 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.2.3 default 2000 34][/pre] Typical logfile entry: [pre]ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049 Channel_Elements::DiceYForward(0.035562635397831,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-8.05275,}): Y out of bounds ! ymin, ymax vs. y : -1.66823 1.66823 vs. -1.66823 Setting y to lower bound ymin=-1.66823[/pre] ID: 38486 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 38495 - Posted: 30 Mar 2019, 9:53:14 UTC - in response to Message 38486. Last modified: 30 Mar 2019, 10:33:15 UTC Time to kick this one out https://lhcathome.cern.ch/lhcathome/result.php?resultid=220223274 ===> [runRivet] Fri Mar 29 16:23:55 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.4.5 default 1000 36] boinc pp winclusive 7000 -,-,10 - sherpa 1.4.5 default -- 19 attempts and 19 times unsuccessful. The job above with all sherpa versions from 1.2.2p up to 2.2.5 no success only losses or failures. Channel_Elements::GenerateYBackward(1.4427957388279e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,8.04076}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYForward(1.0823267357558e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,8.52864}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYBackward(2.1679207931566e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-1.56701}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(4.031082361288e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,1.46371}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(5.3198662953565e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-9.58728}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYForward(1.8783382701291e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,4.04184}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYForward(1.30712e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-0.622555}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYBackward(3.6462283122869e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,0.84522}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(1.2477413842638e-08,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,7.2587}): Y out of bounds ! Setting y to upper bound ymax=9.0996728598133 ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYForward(5.9744231820348e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,3.65001}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. ID: 38495 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,217,443 RAC: 115,477	Message 39140 - Posted: 17 Jun 2019, 16:08:54 UTC ask is running for more than 2 days: https://lhcathome.cern.ch/lhcathome/result.php?resultid=233367127 Remaining integration time is continuously increasing. Will cancel it now. [pre]===> [runRivet] Sat Jun 15 15:46:52 UTC 2019 [boinc ee zhad 133 - - sherpa 2.2.5 default 2000 69] Display update finished (0 histograms, 0 events). 9.89891e+14 pb +- ( 2.46843e+14 pb = 24.9364 % ) 365700000 ( 365700130 -> 99.9 % ) integration time: ( 23h 50m 32s elapsed / 617d 2h 50m 23s left ) [15:48:49] 9.89837e+14 pb +- ( 2.4683e+14 pb = 24.9364 % ) 365720000 ( 365720130 -> 99.9 % ) integration time: ( 23h 50m 37s elapsed / 617d 3h 40m 9s left ) [15:49:00] 9.89783e+14 pb +- ( 2.46816e+14 pb = 24.9364 % ) 365740000 ( 365740130 -> 99.9 % ) integration time: ( 23h 50m 42s elapsed / 617d 4h 30m 33s left ) [15:49:11] 9.89729e+14 pb +- ( 2.46803e+14 pb = 24.9364 % ) 365760000 ( 365760130 -> 99.9 % ) integration time: ( 23h 50m 47s elapsed / 617d 5h 21m 2s left ) [15:49:23] 9.89675e+14 pb +- ( 2.46789e+14 pb = 24.9364 % ) 365780000 ( 365780130 -> 99.9 % ) integration time: ( 23h 50m 52s elapsed / 617d 6h 9m 40s left ) [15:49:33] 9.89621e+14 pb +- ( 2.46776e+14 pb = 24.9364 % ) 365800000 ( 365800130 -> 99.9 % ) integration time: ( 23h 50m 56s elapsed / 617d 6h 58m 43s left ) [15:49:42] Updating display... Display update finished (0 histograms, 0 events). 9.89567e+14 pb +- ( 2.46762e+14 pb = 24.9364 % ) 365820000 ( 365820130 -> 99.9 % ) integration time: ( 23h 51m 1s elapsed / 617d 7h 46m 31s left ) [15:49:51] 9.89512e+14 pb +- ( 2.46749e+14 pb = 24.9364 % ) 365840000 ( 365840130 -> 99.9 % ) integration time: ( 23h 51m 6s elapsed / 617d 8h 34m 13s left ) [15:49:59] 9.89458e+14 pb +- ( 2.46735e+14 pb = 24.9364 % ) 365860000 ( 365860130 -> 99.9 % ) integration time: ( 23h 51m 10s elapsed / 617d 9h 23m 47s left ) [15:50:10] 9.89404e+14 pb +- ( 2.46722e+14 pb = 24.9364 % ) 365880000 ( 365880130 -> 99.9 % ) integration time: ( 23h 51m 15s elapsed / 617d 10h 15m 19s left ) [15:50:23] 9.8935e+14 pb +- ( 2.46708e+14 pb = 24.9364 % ) 365900000 ( 365900130 -> 99.9 % ) integration time: ( 23h 51m 20s elapsed / 617d 11h 4m 40s left ) [15:50:32][/pre] ID: 39140 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 39193 - Posted: 26 Jun 2019, 14:32:43 UTC - in response to Message 39140. Last modified: 26 Jun 2019, 14:33:09 UTC ]This task is running for more than 2 days: https://lhcathome.cern.ch/lhcathome/result.php?resultid=233367127 Remaining integration time is continuously increasing. Will cancel it now. [pre]===> [runRivet] Sat Jun 15 15:46:52 UTC 2019 [boinc ee zhad 133 - - sherpa 2.2.5 default 2000 69][/pre][/quote] Thanks. I managed to find this job and ran it myself. I have the output and can detect that it is a long runner. What do you think should be the limit for long running jobs? ID: 39193 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,217,443 RAC: 115,477	Message 39202 - Posted: 27 Jun 2019, 12:28:13 UTC - in response to Message 39193. What do you think should be the limit for long running jobs? Hard to say. From time to time there are jobs that run a couple of days and finish successfully. In some cases you find jobs in mcplots that have not a single success but are all lost. Cancel them? What if the limits are just a bit too strict? (Remember the Vogons!) In this special case I cancelled it because the "estimated runtime left" was more than 617 days and increasing. ID: 39202 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 39210 - Posted: 27 Jun 2019, 19:23:21 UTC - in response to Message 39202. What if the limits are just a bit too strict? I assume we're speaking to problems with TheoryN not Theory VBox 1) Set task duration to 4 days (on my hosts it seems 95% finish in < 3 days) 2) Set (deadline + days of grace) = 15 3) Allow users a way to extend the task duration if they wish. 4) Allow graceful shutdown I realize 4 is tricky and might take some time to implement. It may even be impossible. Do 1, 2 & 3 for now. Maybe 4 later. ID: 39210 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 39221 - Posted: 28 Jun 2019, 9:05:21 UTC - in response to Message 39210. I have just added some logging and will try to investigate the issues in more detail. There are three categories: looping jobs (infinite time) slow jobs that are reasonable slow jobs that take too long Will look at the results after the weekend once we have some data. ID: 39221 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,217,443 RAC: 115,477	Message 39222 - Posted: 28 Jun 2019, 14:06:43 UTC ttps://lhcathome.cern.ch/lhcathome/result.php?resultid=234031232[/url] More than 951d left. Will cancel it now. [boinc ee zhad 22 - - sherpa 2.2.5 default 2000 72] [pre]Display update finished (0 histograms, 0 events). 6.82996e+15 pb +- ( 1.38502e+15 pb = 20.2786 % ) 834080000 ( 834080795 -> 99.9 % ) integration time: ( 2d 7h 38m elapsed / 951d 4h 43m 15s left ) [13:53:19] Poincare::Poincare(): Inaccurate rotation { a = (0.242668,-0.156307,-0.957434) b = (0,0,1) a' = (0.0476847,0.16094,0.985812) -> rel. dev. (inf,inf,-0.0141884) m_ct = -0.957434 m_st = -0.288652 m_n = (0,-8.26615e-07,1.3495e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.242668,-0.156307,-0.957434) b = (0,0,1) a' = (0.0476847,0.16094,0.985812) -> rel. dev. (inf,inf,-0.0141884) m_ct = -0.957434 m_st = -0.288652 m_n = (0,-8.26615e-07,1.3495e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.16653,-0.245943,-0.954872) b = (0,0,1) a' = (0.133856,0.247181,0.959679) -> rel. dev. (inf,inf,-0.0403208) m_ct = -0.954872 m_st = -0.297019 m_n = (0,-1.34884e-06,3.47416e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.16653,-0.245943,-0.954872) b = (0,0,1) a' = (0.133856,0.247181,0.959679) -> rel. dev. (inf,inf,-0.0403208) m_ct = -0.954872 m_st = -0.297019 m_n = (0,-1.34884e-06,3.47416e-07) } 6.82979e+15 pb +- ( 1.38498e+15 pb = 20.2786 % ) 834100000 ( 834100795 -> 99.9 % ) integration time: ( 2d 7h 38m 5s elapsed / 951d 5h 17m 47s left ) [13:53:30] 6.82963e+15 pb +- ( 1.38495e+15 pb = 20.2786 % ) 834120000 ( 834120795 -> 99.9 % ) integration time: ( 2d 7h 38m 10s elapsed / 951d 5h 52m 2s left ) [13:53:39][/pre] ID: 39222 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 39223 - Posted: 28 Jun 2019, 18:06:39 UTC - in response to Message 39221. I have just added some logging and will try to investigate the issues in more detail. There are three categories: looping jobs (infinite time) slow jobs that are reasonable slow jobs that take too long Will look at the results after the weekend once we have some data. Sounds good. Is the additional logging directed to output files on our hosts or to files on the server we cannot access? ID: 39223 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 39235 - Posted: 1 Jul 2019, 9:04:05 UTC - in response to Message 39223. Last modified: 1 Jul 2019, 9:04:21 UTC Sounds good. Is the additional logging directed to output files on our hosts or to files on the server we cannot access? It is in the standard err of the VM jobs. The job type is printed. ID: 39235 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 39242 - Posted: 1 Jul 2019, 15:03:33 UTC - in response to Message 39222. I have a message back from the Theory team saying that since a few weeks there is a filter which stops sending particular looping jobs. However, I have nevertheless investigated for myself and here are the results: Over the past week there have been over 14K successful jobs on Native Theory and only ~ 10 have been problematic. I ordered the results by disk usage and elapsed time. The largest disk usage of a successful job was 44.45 MB. Of the six failures that were higher, 4 were long runners aborted by users and 2 hit the disk exceeded limit (200MB). Reducing the disk limit to 50MB should catch the problematic jobs. ID: 39242 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 39277 - Posted: 4 Jul 2019, 15:36:43 UTC - in response to Message 39242. I have a message back from the Theory team saying that since a few weeks there is a filter which stops sending particular looping jobs. However, I have nevertheless investigated for myself and here are the results: Over the past week there have been over 14K successful jobs on Native Theory and only ~ 10 have been problematic. I ordered the results by disk usage and elapsed time. The largest disk usage of a successful job was 44.45 MB. Of the six failures that were higher, 4 were long runners aborted by users and 2 hit the disk exceeded limit (200MB). Reducing the disk limit to 50MB should catch the problematic jobs. The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone. Thank you, Laurence :-) ID: 39277 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 39278 - Posted: 4 Jul 2019, 15:58:59 UTC - in response to Message 39277. The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone. That is good enough for me. I don't mind long ones, since my machines run 24/7 anyway. But the age of the universe is probably longer than LHC will run. ID: 39278 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 39279 - Posted: 4 Jul 2019, 16:17:31 UTC - in response to Message 39278. The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone. That is good enough for me. I don't mind long ones, since my machines run 24/7 anyway. But the age of the universe is probably longer than LHC will run. The filter has obsoleted major portions of my watchdog script, at least for this workflow. Next workflow maybe not so much. Toying with the notion of turning my watchdog into "Sherpa hunter" which rejects all the easy jobs and crunches only sherpas with a bad rep from McPlots. <gloat> Pythias are like sixtrack. Anybody can do 'em. </gloat> ID: 39279 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 39283 - Posted: 4 Jul 2019, 20:34:37 UTC Last modified: 5 Jul 2019, 14:55:12 UTC ===> [runRivet] Thu Jul 4 15:08:56 CEST 2019 [boinc pp jets 7000 600 - sherpa 2.2.2 default 34000 72] . . . 15:08:56 +0200 2019-07-04 [INFO] New Job Starting in slot1 15:08:56 +0200 2019-07-04 [INFO] Condor JobID: 501528.68 in slot1 15:09:01 +0200 2019-07-04 [INFO] MCPlots JobID: 50471019 in slot1 16:24:27 +0200 2019-07-05 [INFO] Job finished in slot1 with 0. Btw, this was not a native one, but within Windows VM the last job from task https://lhcathome.cern.ch/lhcathome/result.php?resultid=236627529# ID: 39283 · Reply Quote