Message boards :
Theory Application :
(Native) Theory - Sherpa looooooong runners
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,701,514 RAC: 151,110 |
So we should abort long running tasks? I can do that. What is the cut-off line? 3 hours? 5? Or? There's no reason to abort any task just because it's a longrunner. Even if it takes a couple of days to complete. It's Theory's vbox version that makes us believe that tasks must end within 18h. The native version gives you the opportunity to check the app's logfile and see if there is any progress. tail -fn500 /your_local_boinc_dir/slots/x/cernvm/shared/runRivet.log The only reason to abort a longrunner is if you shut down/reboot your computer every few hours and a task needs more than that time. In this case it would be better to run Theory inside a VM as it preserves it's state. |
Send message Joined: 14 Jan 10 Posts: 1346 Credit: 9,076,274 RAC: 8,350 |
There is no one to one relationship between the number of events and the run time. I've noticed that the combination beam "ee" and "sherpa" rather often fails.2 days 5 hours runtime- yes we need some time for this "small" tasks.Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events? If you want to avoid sherpa long-runners with your watchdog, just abort all sherpa's at the beginning. It's up to you. It's your time, your machine and your electricity. Someone else will do the job. Or maybe allow the user to select the limit.This is impossible. You cannot choose the kind of Theory generator, let alone the number of events. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The native version gives you the opportunity to check the app's logfile and see if there is any progress. That's manual work most volunteers won't want to do especially if they run native Theory on more than 1 or 2 hosts. Hence the need for a watchdog script to automate the chore of checking for progress. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
There is no one to one relationship between the number of events and the run time. I've noticed that the combination beam "ee" and "sherpa" rather often fails.2 days 5 hours runtime- yes we need some time for this "small" tasks.Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events? And lots of those someone elses will run into the task deadline. What will be the response to their complaints... you may deselect native Theory and select Theory VBox instead? Or maybe allow the user to select the limit.This is impossible. You cannot choose the kind of Theory generator, let alone the number of events. I know we cannot choose the generator and number of events. I know those parameters are set by the server and they are immutable. I meant when the task starts the script compares the job's target events to the user's selected limit and if the target events exceeds the user's limit then the script aborts the task. It does not try to substitute the user's events limit for the target events value sent by the server. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
I follow a few native task and one of them run sherpa. It was at runtime at 2days 1hour yesterday. Today i saw it got timed out. We might need to extend time to these sherpa:s. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109636098 |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,701,514 RAC: 151,110 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=220001720 CPU time now >2d 9h Due date was yesterday. ===> [runRivet] Mon Mar 25 21:09:50 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.2.3 default 2000 34] Typical logfile entry: ISR_Handler::MakeISR(..): s' out of bounds. s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049 Channel_Elements::DiceYForward(0.035562635397831,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-8.05275,}): Y out of bounds ! ymin, ymax vs. y : -1.66823 1.66823 vs. -1.66823 Setting y to lower bound ymin=-1.66823 |
Send message Joined: 14 Jan 10 Posts: 1346 Credit: 9,076,274 RAC: 8,350 |
Time to kick this one out https://lhcathome.cern.ch/lhcathome/result.php?resultid=220223274 ===> [runRivet] Fri Mar 29 16:23:55 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.4.5 default 1000 36] boinc pp winclusive 7000 -,-,10 - sherpa 1.4.5 default -- 19 attempts and 19 times unsuccessful. The job above with all sherpa versions from 1.2.2p up to 2.2.5 no success only losses or failures. Channel_Elements::GenerateYBackward(1.4427957388279e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,8.04076}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYForward(1.0823267357558e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,8.52864}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYBackward(2.1679207931566e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-1.56701}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(4.031082361288e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,1.46371}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(5.3198662953565e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-9.58728}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYForward(1.8783382701291e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,4.04184}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYForward(1.30712e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-0.622555}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYBackward(3.6462283122869e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,0.84522}): Y out of bounds ! Setting y to upper bound ymax=10 Channel_Elements::GenerateYBackward(1.2477413842638e-08,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,7.2587}): Y out of bounds ! Setting y to upper bound ymax=9.0996728598133 ISR_Handler::MakeISR(..): s' out of bounds. ISR_Handler::MakeISR(..): s' out of bounds. Channel_Elements::GenerateYForward(5.9744231820348e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,3.65001}): Y out of bounds ! Setting y to lower bound ymin=-10 ISR_Handler::MakeISR(..): s' out of bounds. |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,701,514 RAC: 151,110 |
This task is running for more than 2 days: https://lhcathome.cern.ch/lhcathome/result.php?resultid=233367127 Remaining integration time is continuously increasing. Will cancel it now. ===> [runRivet] Sat Jun 15 15:46:52 UTC 2019 [boinc ee zhad 133 - - sherpa 2.2.5 default 2000 69] Display update finished (0 histograms, 0 events). 9.89891e+14 pb +- ( 2.46843e+14 pb = 24.9364 % ) 365700000 ( 365700130 -> 99.9 % ) integration time: ( 23h 50m 32s elapsed / 617d 2h 50m 23s left ) [15:48:49] 9.89837e+14 pb +- ( 2.4683e+14 pb = 24.9364 % ) 365720000 ( 365720130 -> 99.9 % ) integration time: ( 23h 50m 37s elapsed / 617d 3h 40m 9s left ) [15:49:00] 9.89783e+14 pb +- ( 2.46816e+14 pb = 24.9364 % ) 365740000 ( 365740130 -> 99.9 % ) integration time: ( 23h 50m 42s elapsed / 617d 4h 30m 33s left ) [15:49:11] 9.89729e+14 pb +- ( 2.46803e+14 pb = 24.9364 % ) 365760000 ( 365760130 -> 99.9 % ) integration time: ( 23h 50m 47s elapsed / 617d 5h 21m 2s left ) [15:49:23] 9.89675e+14 pb +- ( 2.46789e+14 pb = 24.9364 % ) 365780000 ( 365780130 -> 99.9 % ) integration time: ( 23h 50m 52s elapsed / 617d 6h 9m 40s left ) [15:49:33] 9.89621e+14 pb +- ( 2.46776e+14 pb = 24.9364 % ) 365800000 ( 365800130 -> 99.9 % ) integration time: ( 23h 50m 56s elapsed / 617d 6h 58m 43s left ) [15:49:42] Updating display... Display update finished (0 histograms, 0 events). 9.89567e+14 pb +- ( 2.46762e+14 pb = 24.9364 % ) 365820000 ( 365820130 -> 99.9 % ) integration time: ( 23h 51m 1s elapsed / 617d 7h 46m 31s left ) [15:49:51] 9.89512e+14 pb +- ( 2.46749e+14 pb = 24.9364 % ) 365840000 ( 365840130 -> 99.9 % ) integration time: ( 23h 51m 6s elapsed / 617d 8h 34m 13s left ) [15:49:59] 9.89458e+14 pb +- ( 2.46735e+14 pb = 24.9364 % ) 365860000 ( 365860130 -> 99.9 % ) integration time: ( 23h 51m 10s elapsed / 617d 9h 23m 47s left ) [15:50:10] 9.89404e+14 pb +- ( 2.46722e+14 pb = 24.9364 % ) 365880000 ( 365880130 -> 99.9 % ) integration time: ( 23h 51m 15s elapsed / 617d 10h 15m 19s left ) [15:50:23] 9.8935e+14 pb +- ( 2.46708e+14 pb = 24.9364 % ) 365900000 ( 365900130 -> 99.9 % ) integration time: ( 23h 51m 20s elapsed / 617d 11h 4m 40s left ) [15:50:32] |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
This task is running for more than 2 days: Thanks. I managed to find this job and ran it myself. I have the output and can detect that it is a long runner. What do you think should be the limit for long running jobs? |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,701,514 RAC: 151,110 |
What do you think should be the limit for long running jobs? Hard to say. From time to time there are jobs that run a couple of days and finish successfully. In some cases you find jobs in mcplots that have not a single success but are all lost. Cancel them? What if the limits are just a bit too strict? (Remember the Vogons!) In this special case I cancelled it because the "estimated runtime left" was more than 617 days and increasing. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
What if the limits are just a bit too strict? I assume we're speaking to problems with TheoryN not Theory VBox 1) Set task duration to 4 days (on my hosts it seems 95% finish in < 3 days) 2) Set (deadline + days of grace) = 15 3) Allow users a way to extend the task duration if they wish. 4) Allow graceful shutdown I realize 4 is tricky and might take some time to implement. It may even be impossible. Do 1, 2 & 3 for now. Maybe 4 later. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
I have just added some logging and will try to investigate the issues in more detail. There are three categories:
|
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,701,514 RAC: 151,110 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=234031232 More than 951d left. Will cancel it now. [boinc ee zhad 22 - - sherpa 2.2.5 default 2000 72] Display update finished (0 histograms, 0 events). 6.82996e+15 pb +- ( 1.38502e+15 pb = 20.2786 % ) 834080000 ( 834080795 -> 99.9 % ) integration time: ( 2d 7h 38m elapsed / 951d 4h 43m 15s left ) [13:53:19] Poincare::Poincare(): Inaccurate rotation { a = (0.242668,-0.156307,-0.957434) b = (0,0,1) a' = (0.0476847,0.16094,0.985812) -> rel. dev. (inf,inf,-0.0141884) m_ct = -0.957434 m_st = -0.288652 m_n = (0,-8.26615e-07,1.3495e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.242668,-0.156307,-0.957434) b = (0,0,1) a' = (0.0476847,0.16094,0.985812) -> rel. dev. (inf,inf,-0.0141884) m_ct = -0.957434 m_st = -0.288652 m_n = (0,-8.26615e-07,1.3495e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.16653,-0.245943,-0.954872) b = (0,0,1) a' = (0.133856,0.247181,0.959679) -> rel. dev. (inf,inf,-0.0403208) m_ct = -0.954872 m_st = -0.297019 m_n = (0,-1.34884e-06,3.47416e-07) } Poincare::Poincare(): Inaccurate rotation { a = (0.16653,-0.245943,-0.954872) b = (0,0,1) a' = (0.133856,0.247181,0.959679) -> rel. dev. (inf,inf,-0.0403208) m_ct = -0.954872 m_st = -0.297019 m_n = (0,-1.34884e-06,3.47416e-07) } 6.82979e+15 pb +- ( 1.38498e+15 pb = 20.2786 % ) 834100000 ( 834100795 -> 99.9 % ) integration time: ( 2d 7h 38m 5s elapsed / 951d 5h 17m 47s left ) [13:53:30] 6.82963e+15 pb +- ( 1.38495e+15 pb = 20.2786 % ) 834120000 ( 834120795 -> 99.9 % ) integration time: ( 2d 7h 38m 10s elapsed / 951d 5h 52m 2s left ) [13:53:39] |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I have just added some logging and will try to investigate the issues in more detail. There are three categories: Sounds good. Is the additional logging directed to output files on our hosts or to files on the server we cannot access? |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Sounds good. Is the additional logging directed to output files on our hosts or to files on the server we cannot access? It is in the standard err of the VM jobs. The job type is printed. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
I have a message back from the Theory team saying that since a few weeks there is a filter which stops sending particular looping jobs. However, I have nevertheless investigated for myself and here are the results: Over the past week there have been over 14K successful jobs on Native Theory and only ~ 10 have been problematic. I ordered the results by disk usage and elapsed time. The largest disk usage of a successful job was 44.45 MB. Of the six failures that were higher, 4 were long runners aborted by users and 2 hit the disk exceeded limit (200MB). Reducing the disk limit to 50MB should catch the problematic jobs. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
I have a message back from the Theory team saying that since a few weeks there is a filter which stops sending particular looping jobs. However, I have nevertheless investigated for myself and here are the results: The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone. Thank you, Laurence :-) |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone. That is good enough for me. I don't mind long ones, since my machines run 24/7 anyway. But the age of the universe is probably longer than LHC will run. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone. The filter has obsoleted major portions of my watchdog script, at least for this workflow. Next workflow maybe not so much. Toying with the notion of turning my watchdog into "Sherpa hunter" which rejects all the easy jobs and crunches only sherpas with a bad rep from McPlots. <gloat> Pythias are like sixtrack. Anybody can do 'em. </gloat> |
Send message Joined: 14 Jan 10 Posts: 1346 Credit: 9,076,274 RAC: 8,350 |
===> [runRivet] Thu Jul 4 15:08:56 CEST 2019 [boinc pp jets 7000 600 - sherpa 2.2.2 default 34000 72] . . . 15:08:56 +0200 2019-07-04 [INFO] New Job Starting in slot1 15:08:56 +0200 2019-07-04 [INFO] Condor JobID: 501528.68 in slot1 15:09:01 +0200 2019-07-04 [INFO] MCPlots JobID: 50471019 in slot1 16:24:27 +0200 2019-07-05 [INFO] Job finished in slot1 with 0. Btw, this was not a native one, but within Windows VM the last job from task https://lhcathome.cern.ch/lhcathome/result.php?resultid=236627529# |
©2024 CERN