Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1141
Credit: 56,176,329
RAC: 96,457
Message 38460 - Posted: 27 Mar 2019, 7:32:42 UTC - in response to Message 38459.  

So we should abort long running tasks? I can do that. What is the cut-off line? 3 hours? 5? Or?

There's no reason to abort any task just because it's a longrunner.
Even if it takes a couple of days to complete.

It's Theory's vbox version that makes us believe that tasks must end within 18h.

The native version gives you the opportunity to check the app's logfile and see if there is any progress.
tail -fn500 /your_local_boinc_dir/slots/x/cernvm/shared/runRivet.log


The only reason to abort a longrunner is if you shut down/reboot your computer every few hours and a task needs more than that time.
In this case it would be better to run Theory inside a VM as it preserves it's state.
ID: 38460 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 753
Credit: 6,033,177
RAC: 1,095
Message 38462 - Posted: 27 Mar 2019, 8:30:33 UTC - in response to Message 38450.  

2 days 5 hours runtime- yes we need some time for this "small" tasks.
Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events?
There is no one to one relationship between the number of events and the run time. I've noticed that the combination beam "ee" and "sherpa" rather often fails.
If you want to avoid sherpa long-runners with your watchdog, just abort all sherpa's at the beginning. It's up to you. It's your time, your machine and your electricity.

Someone else will do the job.

Or maybe allow the user to select the limit.
This is impossible. You cannot choose the kind of Theory generator, let alone the number of events.
ID: 38462 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,251,600
RAC: 9,814
Message 38464 - Posted: 27 Mar 2019, 11:10:53 UTC - in response to Message 38460.  

The native version gives you the opportunity to check the app's logfile and see if there is any progress.
tail -fn500 /your_local_boinc_dir/slots/x/cernvm/shared/runRivet.log

That's manual work most volunteers won't want to do especially if they run native Theory on more than 1 or 2 hosts. Hence the need for a watchdog script to automate the chore of checking for progress.
ID: 38464 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,251,600
RAC: 9,814
Message 38466 - Posted: 27 Mar 2019, 12:00:23 UTC - in response to Message 38462.  

2 days 5 hours runtime- yes we need some time for this "small" tasks.
Good point. Modifying my watchdog script to abort sherpa if it's configured for more than 2K events. Or maybe the limit should be 4K events?
There is no one to one relationship between the number of events and the run time. I've noticed that the combination beam "ee" and "sherpa" rather often fails.
If you want to avoid sherpa long-runners with your watchdog, just abort all sherpa's at the beginning. It's up to you. It's your time, your machine and your electricity.

Someone else will do the job.

And lots of those someone elses will run into the task deadline. What will be the response to their complaints... you may deselect native Theory and select Theory VBox instead?

Or maybe allow the user to select the limit.
This is impossible. You cannot choose the kind of Theory generator, let alone the number of events.

I know we cannot choose the generator and number of events. I know those parameters are set by the server and they are immutable.
I meant when the task starts the script compares the job's target events to the user's selected limit and if the target events exceeds the user's limit then the script aborts the task. It does not try to substitute the user's events limit for the target events value sent by the server.
ID: 38466 · Report as offensive     Reply Quote
Gunde

Send message
Joined: 9 Jan 15
Posts: 37
Credit: 274,364,137
RAC: 424,551
Message 38471 - Posted: 27 Mar 2019, 15:57:35 UTC

I follow a few native task and one of them run sherpa. It was at runtime at 2days 1hour yesterday. Today i saw it got timed out.

We might need to extend time to these sherpa:s.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=109636098
ID: 38471 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1141
Credit: 56,176,329
RAC: 96,457
Message 38486 - Posted: 29 Mar 2019, 11:39:53 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=220001720
CPU time now >2d 9h
Due date was yesterday.

===> [runRivet] Mon Mar 25 21:09:50 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.2.3 default 2000 34]


Typical logfile entry:
ISR_Handler::MakeISR(..): s' out of bounds.
  s'_{min}, s'_{max 1,2} vs. s': 0.0049, 49000000, 49000000 vs. 0.0049
Channel_Elements::DiceYForward(0.035562635397831,{-8.98847e+307,0,-8.98847e+307,0,0,},{-10,10,-8.05275,}):  Y out of bounds ! 
   ymin, ymax vs. y : -1.66823 1.66823 vs. -1.66823
Setting y to lower bound  ymin=-1.66823
ID: 38486 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 753
Credit: 6,033,177
RAC: 1,095
Message 38495 - Posted: 30 Mar 2019, 9:53:14 UTC - in response to Message 38486.  
Last modified: 30 Mar 2019, 10:33:15 UTC

Time to kick this one out https://lhcathome.cern.ch/lhcathome/result.php?resultid=220223274
===> [runRivet] Fri Mar 29 16:23:55 UTC 2019 [boinc pp winclusive 7000 -,-,10 - sherpa 1.4.5 default 1000 36]

boinc pp winclusive 7000 -,-,10 - sherpa 1.4.5 default -- 19 attempts and 19 times unsuccessful.
The job above with all sherpa versions from 1.2.2p up to 2.2.5 no success only losses or failures.

Channel_Elements::GenerateYBackward(1.4427957388279e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,8.04076}):  Y out of bounds ! 
Setting y to upper bound ymax=10
Channel_Elements::GenerateYForward(1.0823267357558e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,8.52864}):  Y out of bounds ! 
Setting y to lower bound  ymin=-10
ISR_Handler::MakeISR(..): s' out of bounds.
ISR_Handler::MakeISR(..): s' out of bounds.
ISR_Handler::MakeISR(..): s' out of bounds.
ISR_Handler::MakeISR(..): s' out of bounds.
ISR_Handler::MakeISR(..): s' out of bounds.
Channel_Elements::GenerateYBackward(2.1679207931566e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-1.56701}):  Y out of bounds ! 
Setting y to upper bound ymax=10
Channel_Elements::GenerateYBackward(4.031082361288e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,1.46371}):  Y out of bounds ! 
Setting y to upper bound ymax=10
Channel_Elements::GenerateYBackward(5.3198662953565e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-9.58728}):  Y out of bounds ! 
Setting y to upper bound ymax=10
Channel_Elements::GenerateYForward(1.8783382701291e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,4.04184}):  Y out of bounds ! 
Setting y to lower bound  ymin=-10
ISR_Handler::MakeISR(..): s' out of bounds.
Channel_Elements::GenerateYForward(1.30712e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,-0.622555}):  Y out of bounds ! 
Setting y to lower bound  ymin=-10
ISR_Handler::MakeISR(..): s' out of bounds.
Channel_Elements::GenerateYBackward(3.6462283122869e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,0.84522}):  Y out of bounds ! 
Setting y to upper bound ymax=10
Channel_Elements::GenerateYBackward(1.2477413842638e-08,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,7.2587}):  Y out of bounds ! 
Setting y to upper bound ymax=9.0996728598133
ISR_Handler::MakeISR(..): s' out of bounds.
ISR_Handler::MakeISR(..): s' out of bounds.
Channel_Elements::GenerateYForward(5.9744231820348e-10,{-8.98847e+307,0,-8.98847e+307,0,0},{-10,10,3.65001}):  Y out of bounds ! 
Setting y to lower bound  ymin=-10
ISR_Handler::MakeISR(..): s' out of bounds.
ID: 38495 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1141
Credit: 56,176,329
RAC: 96,457
Message 39140 - Posted: 17 Jun 2019, 16:08:54 UTC

This task is running for more than 2 days:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=233367127

Remaining integration time is continuously increasing.
Will cancel it now.

===> [runRivet] Sat Jun 15 15:46:52 UTC 2019 [boinc ee zhad 133 - - sherpa 2.2.5 default 2000 69]

Display update finished (0 histograms, 0 events).
9.89891e+14 pb +- ( 2.46843e+14 pb = 24.9364 % ) 365700000 ( 365700130 -> 99.9 % )
integration time:  ( 23h 50m 32s elapsed / 617d 2h 50m 23s left ) [15:48:49]   
9.89837e+14 pb +- ( 2.4683e+14 pb = 24.9364 % ) 365720000 ( 365720130 -> 99.9 % )
integration time:  ( 23h 50m 37s elapsed / 617d 3h 40m 9s left ) [15:49:00]   
9.89783e+14 pb +- ( 2.46816e+14 pb = 24.9364 % ) 365740000 ( 365740130 -> 99.9 % )
integration time:  ( 23h 50m 42s elapsed / 617d 4h 30m 33s left ) [15:49:11]   
9.89729e+14 pb +- ( 2.46803e+14 pb = 24.9364 % ) 365760000 ( 365760130 -> 99.9 % )
integration time:  ( 23h 50m 47s elapsed / 617d 5h 21m 2s left ) [15:49:23]   
9.89675e+14 pb +- ( 2.46789e+14 pb = 24.9364 % ) 365780000 ( 365780130 -> 99.9 % )
integration time:  ( 23h 50m 52s elapsed / 617d 6h 9m 40s left ) [15:49:33]   
9.89621e+14 pb +- ( 2.46776e+14 pb = 24.9364 % ) 365800000 ( 365800130 -> 99.9 % )
integration time:  ( 23h 50m 56s elapsed / 617d 6h 58m 43s left ) [15:49:42]   
Updating display...
Display update finished (0 histograms, 0 events).
9.89567e+14 pb +- ( 2.46762e+14 pb = 24.9364 % ) 365820000 ( 365820130 -> 99.9 % )
integration time:  ( 23h 51m 1s elapsed / 617d 7h 46m 31s left ) [15:49:51]   
9.89512e+14 pb +- ( 2.46749e+14 pb = 24.9364 % ) 365840000 ( 365840130 -> 99.9 % )
integration time:  ( 23h 51m 6s elapsed / 617d 8h 34m 13s left ) [15:49:59]   
9.89458e+14 pb +- ( 2.46735e+14 pb = 24.9364 % ) 365860000 ( 365860130 -> 99.9 % )
integration time:  ( 23h 51m 10s elapsed / 617d 9h 23m 47s left ) [15:50:10]   
9.89404e+14 pb +- ( 2.46722e+14 pb = 24.9364 % ) 365880000 ( 365880130 -> 99.9 % )
integration time:  ( 23h 51m 15s elapsed / 617d 10h 15m 19s left ) [15:50:23]   
9.8935e+14 pb +- ( 2.46708e+14 pb = 24.9364 % ) 365900000 ( 365900130 -> 99.9 % )
integration time:  ( 23h 51m 20s elapsed / 617d 11h 4m 40s left ) [15:50:32]
ID: 39140 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 303
Credit: 232,584
RAC: 454
Message 39193 - Posted: 26 Jun 2019, 14:32:43 UTC - in response to Message 39140.  
Last modified: 26 Jun 2019, 14:33:09 UTC

This task is running for more than 2 days:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=233367127

Remaining integration time is continuously increasing.
Will cancel it now.

===> [runRivet] Sat Jun 15 15:46:52 UTC 2019 [boinc ee zhad 133 - - sherpa 2.2.5 default 2000 69]


Thanks. I managed to find this job and ran it myself. I have the output and can detect that it is a long runner. What do you think should be the limit for long running jobs?
ID: 39193 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1141
Credit: 56,176,329
RAC: 96,457
Message 39202 - Posted: 27 Jun 2019, 12:28:13 UTC - in response to Message 39193.  

What do you think should be the limit for long running jobs?

Hard to say.

From time to time there are jobs that run a couple of days and finish successfully.

In some cases you find jobs in mcplots that have not a single success but are all lost.
Cancel them?
What if the limits are just a bit too strict?
(Remember the Vogons!)


In this special case I cancelled it because the "estimated runtime left" was more than 617 days and increasing.
ID: 39202 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,251,600
RAC: 9,814
Message 39210 - Posted: 27 Jun 2019, 19:23:21 UTC - in response to Message 39202.  

What if the limits are just a bit too strict?


I assume we're speaking to problems with TheoryN not Theory VBox

1) Set task duration to 4 days (on my hosts it seems 95% finish in < 3 days)
2) Set (deadline + days of grace) = 15
3) Allow users a way to extend the task duration if they wish.
4) Allow graceful shutdown

I realize 4 is tricky and might take some time to implement. It may even be impossible. Do 1, 2 & 3 for now. Maybe 4 later.
ID: 39210 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 303
Credit: 232,584
RAC: 454
Message 39221 - Posted: 28 Jun 2019, 9:05:21 UTC - in response to Message 39210.  

I have just added some logging and will try to investigate the issues in more detail. There are three categories:

    * looping jobs (infinite time)
    * slow jobs that are reasonable
    * slow jobs that take too long


Will look at the results after the weekend once we have some data.

ID: 39221 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1141
Credit: 56,176,329
RAC: 96,457
Message 39222 - Posted: 28 Jun 2019, 14:06:43 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=234031232
More than 951d left.
Will cancel it now.

[boinc ee zhad 22 - - sherpa 2.2.5 default 2000 72]
Display update finished (0 histograms, 0 events).
6.82996e+15 pb +- ( 1.38502e+15 pb = 20.2786 % ) 834080000 ( 834080795 -> 99.9 % )
integration time:  ( 2d 7h 38m elapsed / 951d 4h 43m 15s left ) [13:53:19]   
Poincare::Poincare(): Inaccurate rotation {
  a    = (0.242668,-0.156307,-0.957434)
  b    = (0,0,1)
  a'   = (0.0476847,0.16094,0.985812) -> rel. dev. (inf,inf,-0.0141884)
  m_ct = -0.957434
  m_st = -0.288652
  m_n  = (0,-8.26615e-07,1.3495e-07)
}
Poincare::Poincare(): Inaccurate rotation {
  a    = (0.242668,-0.156307,-0.957434)
  b    = (0,0,1)
  a'   = (0.0476847,0.16094,0.985812) -> rel. dev. (inf,inf,-0.0141884)
  m_ct = -0.957434
  m_st = -0.288652
  m_n  = (0,-8.26615e-07,1.3495e-07)
}
Poincare::Poincare(): Inaccurate rotation {
  a    = (0.16653,-0.245943,-0.954872)
  b    = (0,0,1)
  a'   = (0.133856,0.247181,0.959679) -> rel. dev. (inf,inf,-0.0403208)
  m_ct = -0.954872
  m_st = -0.297019
  m_n  = (0,-1.34884e-06,3.47416e-07)
}
Poincare::Poincare(): Inaccurate rotation {
  a    = (0.16653,-0.245943,-0.954872)
  b    = (0,0,1)
  a'   = (0.133856,0.247181,0.959679) -> rel. dev. (inf,inf,-0.0403208)
  m_ct = -0.954872
  m_st = -0.297019
  m_n  = (0,-1.34884e-06,3.47416e-07)
}
6.82979e+15 pb +- ( 1.38498e+15 pb = 20.2786 % ) 834100000 ( 834100795 -> 99.9 % )
integration time:  ( 2d 7h 38m 5s elapsed / 951d 5h 17m 47s left ) [13:53:30]   
6.82963e+15 pb +- ( 1.38495e+15 pb = 20.2786 % ) 834120000 ( 834120795 -> 99.9 % )
integration time:  ( 2d 7h 38m 10s elapsed / 951d 5h 52m 2s left ) [13:53:39]
ID: 39222 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,251,600
RAC: 9,814
Message 39223 - Posted: 28 Jun 2019, 18:06:39 UTC - in response to Message 39221.  

I have just added some logging and will try to investigate the issues in more detail. There are three categories:

    * looping jobs (infinite time)
    * slow jobs that are reasonable
    * slow jobs that take too long


Will look at the results after the weekend once we have some data.



Sounds good. Is the additional logging directed to output files on our hosts or to files on the server we cannot access?
ID: 39223 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 303
Credit: 232,584
RAC: 454
Message 39235 - Posted: 1 Jul 2019, 9:04:05 UTC - in response to Message 39223.  
Last modified: 1 Jul 2019, 9:04:21 UTC

Sounds good. Is the additional logging directed to output files on our hosts or to files on the server we cannot access?

It is in the standard err of the VM jobs. The job type is printed.
ID: 39235 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 303
Credit: 232,584
RAC: 454
Message 39242 - Posted: 1 Jul 2019, 15:03:33 UTC - in response to Message 39222.  

I have a message back from the Theory team saying that since a few weeks there is a filter which stops sending particular looping jobs. However, I have nevertheless investigated for myself and here are the results:

Over the past week there have been over 14K successful jobs on Native Theory and only ~ 10 have been problematic. I ordered the results by disk usage and elapsed time.
The largest disk usage of a successful job was 44.45 MB. Of the six failures that were higher, 4 were long runners aborted by users and 2 hit the disk exceeded limit (200MB).
Reducing the disk limit to 50MB should catch the problematic jobs.
ID: 39242 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,251,600
RAC: 9,814
Message 39277 - Posted: 4 Jul 2019, 15:36:43 UTC - in response to Message 39242.  

I have a message back from the Theory team saying that since a few weeks there is a filter which stops sending particular looping jobs. However, I have nevertheless investigated for myself and here are the results:

Over the past week there have been over 14K successful jobs on Native Theory and only ~ 10 have been problematic. I ordered the results by disk usage and elapsed time.
The largest disk usage of a successful job was 44.45 MB. Of the six failures that were higher, 4 were long runners aborted by users and 2 hit the disk exceeded limit (200MB).
Reducing the disk limit to 50MB should catch the problematic jobs.


The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone.
Thank you, Laurence :-)
ID: 39277 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 330
Credit: 10,802,450
RAC: 18,174
Message 39278 - Posted: 4 Jul 2019, 15:58:59 UTC - in response to Message 39277.  

The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone.

That is good enough for me. I don't mind long ones, since my machines run 24/7 anyway. But the age of the universe is probably longer than LHC will run.
ID: 39278 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,251,600
RAC: 9,814
Message 39279 - Posted: 4 Jul 2019, 16:17:31 UTC - in response to Message 39278.  

The filter seems to be be working well. I've caught a few sherpas recently and some have been long runners but they finish successfully. The forever loopers seem to be gone.

That is good enough for me. I don't mind long ones, since my machines run 24/7 anyway. But the age of the universe is probably longer than LHC will run.


The filter has obsoleted major portions of my watchdog script, at least for this workflow. Next workflow maybe not so much.
Toying with the notion of turning my watchdog into "Sherpa hunter" which rejects all the easy jobs and crunches only sherpas with a bad rep from McPlots.
<gloat>
Pythias are like sixtrack. Anybody can do 'em.
</gloat>
ID: 39279 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 753
Credit: 6,033,177
RAC: 1,095
Message 39283 - Posted: 4 Jul 2019, 20:34:37 UTC
Last modified: 5 Jul 2019, 14:55:12 UTC

===> [runRivet] Thu Jul 4 15:08:56 CEST 2019 [boinc pp jets 7000 600 - sherpa 2.2.2 default 34000 72]
.
.
.
15:08:56 +0200 2019-07-04 [INFO] New Job Starting in slot1
15:08:56 +0200 2019-07-04 [INFO] Condor JobID: 501528.68 in slot1
15:09:01 +0200 2019-07-04 [INFO] MCPlots JobID: 50471019 in slot1
16:24:27 +0200 2019-07-05 [INFO] Job finished in slot1 with 0.

Btw, this was not a native one, but within Windows VM the last job from task https://lhcathome.cern.ch/lhcathome/result.php?resultid=236627529#
ID: 39283 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners


©2019 CERN