Message boards : Theory Application : Theory's endless looping
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 34617 - Posted: 13 Mar 2018, 19:28:36 UTC

Another one here
11:11:14 +0100 2018-03-13 [INFO] New Job Starting in slot1
11:11:15 +0100 2018-03-13 [INFO] Condor JobID: 29893.172 in slot1
11:11:20 +0100 2018-03-13 [INFO] MCPlots JobID: 43001477 in slot1

===> [runRivet] Tue Mar 13 11:11:15 CET 2018 [boinc pp top-mc 7000 - - sherpa 2.1.0 default 100000 128]

Quite a few:

Exception_Handler::SignalHandler: Signal (6) caught.
Cannot continue.
Exception_Handler::GenerateStackTrace(..): Generating stack trace

Followed by 9 hours of:

Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
until I manually reset the VM through VBox.
ID: 34617 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2090
Credit: 158,974,568
RAC: 124,490
Message 34621 - Posted: 14 Mar 2018, 5:44:15 UTC - in response to Message 34611.  


3. Thanks to all our volunteers! And tomorrow we pass 4 TRILLION EVENTS for Theory !!!!

Ben


Total number of generated events: 4000 billions

Exactly reached today!
Thank you Ben for your work therefore from us Volunteers.
ID: 34621 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,858
RAC: 1,977
Message 35028 - Posted: 18 Apr 2018, 5:46:30 UTC

This morning (April 18th) still running: ===> [runRivet] Tue Apr 17 08:34:51 CEST 2018 [boinc ppbar jets 1960 17 - sherpa 1.4.0 default 100000 158]
.
.
.
2400 events processed
Event 2500 ( 3m 27s elapsed / 2h 14m 41s left ) -> ETA: Tue Apr 17 11:57
2500 events processed
Updating display...
Display update finished (37 histograms, 2000 events).
Event 2600 ( 3m 39s elapsed / 2h 16m 56s left ) -> ETA: Tue Apr 17 11:59
2600 events processed
Error in Splitting_Tools::ConstructKinematics(kt = -nan, z = 0.9843, y = 0.00280905).
Event 2700 ( 3m 45s elapsed / 2h 15m 24s left ) -> ETA: Tue Apr 17 11:58
2700 events processed
Event 2800 ( 3m 50s elapsed / 2h 13m 34s left ) -> ETA: Tue Apr 17 11:56
2800 events processed
Updating display...
Display update finished (37 histograms, 2000 events).
Updating display...
Display update finished (37 histograms, 2000 events).
Updating display...
Display update finished (37 histograms, 2000 events).
Updating display...
Display update finished (37 histograms, 2000 events).
etc etc etc
ID: 35028 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35047 - Posted: 19 Apr 2018, 11:10:44 UTC

Got an endless looping Sherpa.
current local time: 2018-04-19 13:06 +0200

===> [runRivet] Wed Apr 18 23:29:16 CEST 2018 [boinc pp jets 7000 20,-,460 - sherpa 1.4.3 default 100000 158]


Display update finished (0 histograms, 0 events).
5.04459 pb +- ( 0.0294021 pb = 0.582843 % ) 450000 ( 1906011 -> 23.2 % )
integration time:  ( 1h 15m 46s elapsed / 0s left )   
2_4__j__j__j__j__j__j__NQ_0-4 : 5.04459 pb +- ( 0.0294021 pb = 0.582843 % )  exp. eff: 0.0246074 %
  reduce max for 2_4__j__j__j__j__j__j__NQ_0-4 to 0.892277 ( eps = 0.001 ) 
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative       : Signal_Processes
Perturbative       : Hard_Decays
Perturbative       : Jet_Evolution:CSS
Perturbative       : Lepton_FS_QED_Corrections:Photons
Perturbative       : Multiple_Interactions:Amisic
Perturbative       : Minimum_Bias:Off
Hadronization      : Beam_Remnants
Hadronization      : Hadronization:Ahadic
Hadronization      : Hadron_Decays
---------------------------------------------------------
#--------------------------------------------------------------------------
#                         FastJet release 3.0.3
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                       
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#								      	   
# FastJet is provided without warranty under the terms of the GNU GPLv2.
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
  Event 100 ( 6s elapsed / 1h 41m 23s left ) -> ETA: Thu Apr 19 02:56  
100 events processed
dumping histograms...
  Event 200 ( 12s elapsed / 1h 41m 27s left ) -> ETA: Thu Apr 19 02:56  
200 events processed
dumping histograms...
  Event 300 ( 19s elapsed / 1h 47m 20s left ) -> ETA: Thu Apr 19 03:02  
300 events processed
dumping histograms...
Error in Splitting_Tools::ConstructKinematics(kt = -nan, z = 0.55258, y = 0.409048).
  Event 400 ( 25s elapsed / 1h 44m 59s left ) -> ETA: Thu Apr 19 03:00  
400 events processed
dumping histograms...
Updating display...



25500 events processed
Updating display...
Display update finished (9 histograms, 25000 events).
  Event 25600 ( 29m 32s elapsed / 1h 25m 52s left ) -> ETA: Thu Apr 19 03:10  
25600 events processed
  Event 25700 ( 29m 39s elapsed / 1h 25m 43s left ) -> ETA: Thu Apr 19 03:10  
25700 events processed
  Event 25800 ( 29m 45s elapsed / 1h 25m 35s left ) -> ETA: Thu Apr 19 03:10  
25800 events processed
  Event 25900 ( 29m 51s elapsed / 1h 25m 26s left ) -> ETA: Thu Apr 19 03:10  
25900 events processed
  Event 26000 ( 29m 59s elapsed / 1h 25m 20s left ) -> ETA: Thu Apr 19 03:10  
  XS = 5.26364e+06 pb +- ( 32593.3 pb = 0.61 % )  
26000 events processed
dumping histograms...
  Event 26100 ( 30m 6s elapsed / 1h 25m 14s left ) -> ETA: Thu Apr 19 03:10  
26100 events processed
  Event 26200 ( 30m 12s elapsed / 1h 25m 6s left ) -> ETA: Thu Apr 19 03:10  
26200 events processed
Updating display...
Display update finished (9 histograms, 26000 events).
Updating display...
Display update finished (9 histograms, 26000 events).
Updating display...
Display update finished (9 histograms, 26000 events).

etc ...
ID: 35047 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 35080 - Posted: 23 Apr 2018, 21:13:34 UTC

16:29:48 +0200 2018-04-23 [INFO] Condor JobID: 56135.1 in slot1
16:29:53 +0200 2018-04-23 [INFO] MCPlots JobID: 43589835 in slot1

===> [runRivet] Mon Apr 23 16:29:48 CEST 2018 [boinc ee zhad 133 - - sherpa 1.4.0 default 100000 160]

Looks OK through all the initial setup stuff then gets stuck doing the actual events

Event 16300 ( 16m 37s elapsed / 1h 25m 21s left ) -> ETA: Mon Apr 23 18:49
16300 events processed
Event 16400 ( 16m 43s elapsed / 1h 25m 13s left ) -> ETA: Mon Apr 23 18:49
16400 events processed
Updating display...
Display update finished (55 histograms, 16000 events).
Updating display...
Display update finished (55 histograms, 16000 events).
Updating display...
Display update finished (55 histograms, 16000 events).
.......

I manually reset the VM through VBox so as not to waste the remaining 3+ hours until the self-termination cut-off.
ID: 35080 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35103 - Posted: 28 Apr 2018, 7:36:42 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=188132743

===> [runRivet] Fri Apr 27 22:19:48 CEST 2018 [boinc ppbar jets 1960 90 - sherpa 1.4.1 default 100000 162]


Display update finished (0 histograms, 0 events).
310.032 pb +- ( 2.91949 pb = 0.941674 % ) 430000 ( 938273 -> 47.7 % )
integration time:  ( 38m 11s elapsed / 1m 47s left )   
Updating display...
Display update finished (0 histograms, 0 events).
309.189 pb +- ( 2.84765 pb = 0.921006 % ) 440000 ( 959082 -> 47.8 % )
integration time:  ( 39m 4s elapsed / 53s left )   
Updating display...
Display update finished (0 histograms, 0 events).
309.549 pb +- ( 2.8056 pb = 0.906353 % ) 450000 ( 979971 -> 47.8 % )
integration time:  ( 39m 55s elapsed / 0s left )   
2_4__j__j__j__j__j__j__NQ_0-4 : 309.549 pb +- ( 2.8056 pb = 0.906353 % )  exp. eff: 0.0227298 %
  reduce max for 2_4__j__j__j__j__j__j__NQ_0-4 to 0.960342 ( eps = 0.001 ) 
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative       : Signal_Processes
Perturbative       : Hard_Decays
Perturbative       : Jet_Evolution:CSS
Perturbative       : Lepton_FS_QED_Corrections:Photons
Perturbative       : Multiple_Interactions:Amisic
Perturbative       : Minimum_Bias:Off
Hadronization      : Beam_Remnants
Hadronization      : Hadronization:Ahadic
Hadronization      : Hadron_Decays
---------------------------------------------------------
#--------------------------------------------------------------------------
#                         FastJet release 3.0.3
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                       
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#								      	   
# FastJet is provided without warranty under the terms of the GNU GPLv2.
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
#-------------------------------------------------------------------------
# You are running the CDF MidPoint plugin for FastJet                     
# This is based on an implementation provided by Joey Huston.             
# If you use this plugin, please cite                                     
#   G. C. Blazey et al., hep-ex/0005012.                                  
# in addition to the usual FastJet reference.                             
#-------------------------------------------------------------------------
Error in Splitting_Tools::ConstructKinematics(kt = -nan, z = 0.518482, y = 0.614622).
  Event 100 ( 13s elapsed / 3h 47m 6s left ) -> ETA: Sat Apr 28 03:09  
100 events processed
dumping histograms...
  Event 200 ( 24s elapsed / 3h 27m 20s left ) -> ETA: Sat Apr 28 02:49  
200 events processed
dumping histograms...
Updating display...
Display update finished (37 histograms, 200 events).
  Event 300 ( 39s elapsed / 3h 41m 23s left ) -> ETA: Sat Apr 28 03:04  
300 events processed


Display update finished (37 histograms, 41000 events).
  Event 41400 ( 1h 26m 34s elapsed / 2h 2m 33s left ) -> ETA: Sat Apr 28 02:51  
41400 events processed
  Event 41500 ( 1h 26m 47s elapsed / 2h 2m 20s left ) -> ETA: Sat Apr 28 02:51  
41500 events processed
  Event 41600 ( 1h 26m 58s elapsed / 2h 2m 6s left ) -> ETA: Sat Apr 28 02:51  
41600 events processed
Error in Splitting_Tools::ConstructKinematics(kt = -nan, z = 0.985119, y = 0.00521327).
  Event 41700 ( 1h 27m 9s elapsed / 2h 1m 51s left ) -> ETA: Sat Apr 28 02:51  
41700 events processed
  Event 41800 ( 1h 27m 21s elapsed / 2h 1m 38s left ) -> ETA: Sat Apr 28 02:51  
41800 events processed
Updating display...
Display update finished (37 histograms, 41000 events).
Error in Splitting_Tools::ConstructKinematics(kt = -nan, z = 0.427847, y = 0.591376).
  Event 41900 ( 1h 27m 37s elapsed / 2h 1m 29s left ) -> ETA: Sat Apr 28 02:51  
41900 events processed
Updating display...
Display update finished (37 histograms, 41000 events).
Updating display...
Display update finished (37 histograms, 41000 events).
Updating display...
Display update finished (37 histograms, 41000 events).
Updating display...
etc. etc. etc. (ETA: Sat Apr 28 02:51; NOW: Sat Apr 28 09:27)
ID: 35103 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35115 - Posted: 29 Apr 2018, 18:03:29 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=188554418
Shut it down after several idle hours.


===> [runRivet] Sun Apr 29 06:58:40 CEST 2018 [boinc ee zhad 206 - - sherpa 1.4.5 default 13000 166]

2.49119 pb +- ( 0.0124095 pb = 0.498135 % ) 300000 ( 347924 -> 86.5 % )
integration time:  ( 2m 57s(2m 40s) elapsed / 6s(5s) left )   
2.49366 pb +- ( 0.0122476 pb = 0.49115 % ) 310000 ( 359559 -> 86.5 % )
integration time:  ( 3m 2s(2m 45s) elapsed / 0s(0s) left )   
2_4__e-__e+__j__j__j__j : 2.49366 pb +- ( 0.0122476 pb = 0.49115 % )  exp. eff: 0.403783 %
  reduce max for 2_4__e-__e+__j__j__j__j to 0.793593 ( eps = 0.001 ) 
Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix)
Starting the calculation. Lean back and enjoy ... .
  Exception_Handler::GenerateStackTrace(..): Generating stack trace 
  {
  }
  
  Exception_Handler::SignalHandler: Signal (6) caught. 
     Cannot continue.
  Exception_Handler::GenerateStackTrace(..): Generating stack trace 
  {
  }
  Exception_Handler::GenerateStackTrace(..): Generating stack trace 
  {
  }
  
etc etc etc (more than 30 times)

then for many hours:
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

etc etc etc
ID: 35115 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35120 - Posted: 30 Apr 2018, 15:20:32 UTC - in response to Message 35080.  

I manually reset the VM through VBox so as not to waste the remaining 3+ hours until the self-termination cut-off.


I have one of these misbehaving sherpa jobs too. How does one manually reset the VM through VBox? Does it have any advantage over aborting the task (by clicking the Abort button in boinc manager)? Aborting the task means you don't get any credits for it. Do you get credits if you reset the VM?

I've been playing with the VBoxManage command in a terminal but can't get the syntax right. Also, it looks like you need to be in the correct directory or something. Seems something like..

VBoxManage controlvm <uuid|vmname> reset


...ought to do it but I don't know where/how to get the uuid or vmname.
ID: 35120 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35122 - Posted: 30 Apr 2018, 15:58:06 UTC - in response to Message 35120.  

If you are really sure (and you should triplecheck that!) that the scientific application got stuck in an endless loop and will not end before the 18h limit, run "touch path/to/your/BOINC_client/slots/x/shared/shutdown", with x=relevant_slot_number.

This will immediately kill the running tasks inside the VM but will do a graceful shutdown of the VM itself and will ensure you get the credits.
ID: 35122 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35123 - Posted: 30 Apr 2018, 17:15:48 UTC - in response to Message 35122.  

Thanks for that.
To be honest I am only quite sure because I only doublechecked it. You see I don't know how to triplecheck it but I am willing to learn. I wouldn't want to kill a sherpa without being really sure.
ID: 35123 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 35129 - Posted: 1 May 2018, 10:57:28 UTC - in response to Message 35120.  

If you Abort the Task you'll get No credit for it. If you reset the VM it will likely run to the 18hr cutoff so you'll probably get more for the individual task but the same pro-rata amount.

Rather than digging through slots etc., I do it through the GUI.
I don't look at EVERY job or task but Tasks normally end a little over 12 hours, once it has finished the job in progress at that time, so the first clue is a 13hr+ runtime. Check in View Console then Alt/F2 that it is "Updating Display" but the number of events isn't increasing. If you can be bothered, leave it a few mins then check again to be sure. If there is any doubt, I look in View Graphics to see what kind of job is running. The looping only seems to affect A FEW Sherpas so any other variety can be left. In "View Console" note the "localhost" number. Into VBox, find the VM with that number +1 in Remote Desktop, right-click and select Reset.
ID: 35129 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1276
Credit: 8,481,858
RAC: 1,977
Message 35133 - Posted: 1 May 2018, 20:16:54 UTC - in response to Message 35122.  

If you are really sure (and you should triplecheck that!) that the scientific application got stuck in an endless loop and will not end before the 18h limit, run "touch path/to/your/BOINC_client/slots/x/shared/shutdown", with x=relevant_slot_number.

This will immediately kill the running tasks inside the VM but will do a graceful shutdown of the VM itself and will ensure you get the credits.

For Windows users create a batch file like the one below and change the part D:\Boinc1 for your situation:

@echo off
set "slotdir="
set /p "slotdir=In which slot-directory is the endless Theory task running you want to kill? "
set boincpath="D:\Boinc1\slots\%slotdir%\shared"
copy /y NUL %boincpath%\shutdown >NUL
exit
ID: 35133 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35134 - Posted: 1 May 2018, 20:55:33 UTC - in response to Message 35129.  

Thanks Ray. Though I didn't say it earlier, I was looking for a method I can use in a terminal. The method computezrmie described will work nicely for me.

The reason I prefer that method is because if one wants to bother dealing with the loopers then the bother should be minimal and the benefit maximal. So I'm thinking about how to deal with loopers via a Python script that detects them and then deals with them in some so far undecided way. The reason I haven't decided how to deal with them is because they aren't always "dead" when they appear to be. Before I spend time on code that deals with them, I want to be sure I can code a way to detect them reliably or at least somewhat reliably. The degree of reliability might determine what the code does with them or whether it should do anything at all (which is to say maybe such a script is pointless).

I thought I had the detection part of the code nailed down but today discovered that I do not. The code does pretty much what has been described in this thread. Every 5 minutes it gets the running.log (same log we see when we highlight a Theory task and click Show graphics, etc) and checks if the number of processed events is increasing. Specifically, it looks at the "Display update finished..." lines. If the last 40 of those lines are all the same then the job is deemed to be looping, a zombie in other words.

That seemed to be pretty reliable but today I discovered that if I allow a looper to continue then sometimes it resumes incrementing the processed events again and runs to normal completion. So the question is.... when is enough "proof" actually enough? I could change the 40 to a 60 or even 80 but that doesn't guarantee a job won't come back to life and achieve its events target (which seems to almost always be 100K events).

So now I am thinking the only sane way for the script to deal with what appears to be a looper is to let it continue and if it's still not processing events when the task (the BOINC task) reaches the 18 hour limit then save the log (or at least the run parameters, they're easily parsed out of the log) to a file and and attach that file to an email to an LHC team member. If what Crystal Pellet has claimed in this thread (that the loopers don't get reported) is true then how else would the team know?

If I can work the script into something that serves a useful purpose to the project then I'll make it available for download. Comments appreciated from all.
ID: 35134 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35135 - Posted: 2 May 2018, 8:15:36 UTC

Is this log snippet useful?
Do you need other parts of the logs?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=188663763
===> [runRivet] Wed May  2 01:30:50 CEST 2018 [boinc ppbar uemb-soft 63 - - sherpa 2.1.0 default 1000 172]

integration time:  ( 6m 34s (6m 16s) elapsed / 13s (13s) left ) [01:44:43]   
1.05733e+09 pb +- ( 3.31997e+06 pb = 0.313996 % ) 310000 ( 673992 -> 45.4 % )
integration time:  ( 6m 47s (6m 28s) elapsed / 0s (0s) left ) [01:44:55]   
2_2__j__j__j__j : 1.05733e+09 pb +- ( 3.31997e+06 pb = 0.313996 % )  exp. eff: 0.406201 %
  reduce max for 2_2__j__j__j__j to 0.568866 ( eps = 0.001 ) 
Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative       : Signal_Processes
Perturbative       : Hard_Decays
Perturbative       : Jet_Evolution:CSS
Perturbative       : Lepton_FS_QED_Corrections:Photons
Perturbative       : Multiple_Interactions:Amisic
Perturbative       : Minimum_Bias:Off
Hadronization      : Beam_Remnants
Hadronization      : Hadronization:Ahadic
Hadronization      : Hadron_Decays
Analysis           : HepMC2
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...

etc etc etc
ID: 35135 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35136 - Posted: 2 May 2018, 8:44:17 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=188666962
This one has lots of errors like:
METS_Scale_Setter::SetScales(): Failed to determine \mu.

I didn't cancel it as it's ETA is before the 18h limit.
ID: 35136 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35140 - Posted: 2 May 2018, 20:18:54 UTC - in response to Message 35135.  

Is this log snippet useful?
Do you need other parts of the logs?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=188663763
===> [runRivet] Wed May  2 01:30:50 CEST 2018 [boinc ppbar uemb-soft 63 - - sherpa 2.1.0 default 1000 172]

integration time:  ( 6m 34s (6m 16s) elapsed / 13s (13s) left ) [01:44:43]   
1.05733e+09 pb +- ( 3.31997e+06 pb = 0.313996 % ) 310000 ( 67


Thank you but I have an ample number of logs saved to disk. If I need more I can easily obtain them from the web server running in the VM.

I think I have it figured out now. I accidentally changed the counter that counts the number of consecutive occurrences of the same "Display update finished (x histograms, y events)" line from a local var to a global var which prevented it from initializing to 0 prior to counting. It assumed "it's a looper" before the desired criteria was met. Not surprising those jobs eventually started processing events again.

With that error corrected I have restored confidence that the script can reliably detect loopers. I added a little to it to help me test its reliability easier and more thoroughly. If that goes well I will make a github project for it.
ID: 35140 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 35141 - Posted: 2 May 2018, 20:40:22 UTC - in response to Message 35136.  

https://lhcathome.cern.ch/lhcathome/result.php?resultid=188666962
This one has lots of errors like:
METS_Scale_Setter::SetScales(): Failed to determine \mu.

I didn't cancel it as it's ETA is before the 18h limit.


In the logs I've noticed numerous phrases that indicate errors but its seems to me many of those errors are not fatal or critical errors. They seem to sometimes cause a temporary glitch but eventually the processed events count starts increasing and eventually the job reaches its target number of events. Again, sometimes. Other times the target is not reached.

So far I've been unable to deduce which error phrases indicate a fatal error (unrecoverable, fatal to the job) and which indicate just a temporary glitch. I think it would take years to deduce that from the logs so I won't waste time even trying. The project scientists and generator devs could likely tell me though I think they're more likely to tell me to go look at the sherpa code and figure it out for myself. Maybe someday. For now I'll stick with watching the processed events counter.
ID: 35141 · Report as offensive     Reply Quote
Peter Skands

Send message
Joined: 31 Jan 11
Posts: 12
Credit: 3,557,813
RAC: 0
Message 35142 - Posted: 3 May 2018, 1:07:28 UTC - in response to Message 35141.  

Hi all,

Thanks as always for your dedication and patience. As Crystal mentions, the looping issue only affects a few Sherpa jobs, and as I think I've stated elsewhere, we still want to run those Sherpa versions to be able to display comparisons with them - as long as they are being used actively by researchers in the community. In the major new update of the jobs we are planning to start sending out shortly, some of the oldest Sherpa versions that we don't deem are being used any more will be deprecated, and those also appear to be the most frequent loopers. I cannot promise that there will not still be *some* looping jobs happening in the newer versions. At least a partial consolation is that, as far as I know, you are at least still getting credits for them, even though I totally understand that it is frustrating that your CPU is basically idling during those jobs and not contributing to science. The way the particle physics software development works is that once a version is publicly released, it is not changed any more, not even to fix bugs, even serious ones. Bug fixes and patches are of course developed and applied (and the presence of a serious bug can cause us to 'withdraw' a version or at least issue a strong recommendation not to use it), but patches and fixes go into future versions; past ones are never 'patched backwards'. The reason we do it that way is for reproducibility, which is extremely important in science, so that a given public version always produces the same results (even when this includes losing some jobs). That means that not only you guys, but also researchers who run these versions on their own computers, or on clusters, have no choice but to accept basically that there is an 'efficiency' of the jobs which is not 100% (but still rather close to 100% as far as I know). I keep being impressed and amazed at what the volunteer community is capable of; the development of a script that some of you guys have been talking about, that automatically detects the looping run condition and gracefully shuts down the job, is extremely nice. Although we don't have a lot of manpower currently for upgrading the run software, I will try hard to see if we can incorporate such a trigger in our default job setups, so that this could be done automatically. This is really great work!

All the best,
Peter
ID: 35142 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35145 - Posted: 3 May 2018, 5:11:02 UTC - in response to Message 35142.  

Hi Peter,

Thanks for your detailed explanation.

As I understand, the logs of those jobs hitting the 18h limit are lost.
Thus error analysis (and improvements based on it) is nearly impossible, even if you have the resources.

In the past I posted a few logfile snippets to give your developers a chance but they were chosen just by looking at posts from other volunteers, e.g. CP.
If you need other parts of the logfile you may post it here.
ID: 35145 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2402
Credit: 225,606,766
RAC: 121,595
Message 35203 - Posted: 9 May 2018, 9:17:04 UTC

Another looping Sherpa:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=189954119

===> [runRivet] Wed May  9 07:28:49 CEST 2018 [boinc ppbar uemb-soft 53 - - sherpa 2.1.1 default 1000 178]

...

Updating display...
Display update finished (0 histograms, 0 events).
7.6288e+08 pb +- ( 2.60407e+06 pb = 0.341348 % ) 260000 ( 564306 -> 45.9 % )
integration time:  ( 5m 2s elapsed / 58s left ) [07:42:35]   
7.62494e+08 pb +- ( 2.50208e+06 pb = 0.328144 % ) 280000 ( 608421 -> 45.7 % )
integration time:  ( 5m 27s elapsed / 35s left ) [07:43:01]   
Updating display...
Display update finished (0 histograms, 0 events).
7.63022e+08 pb +- ( 2.40571e+06 pb = 0.315287 % ) 300000 ( 651789 -> 45.8 % )
integration time:  ( 5m 53s elapsed / 12s left ) [07:43:28]   
7.62597e+08 pb +- ( 2.35488e+06 pb = 0.308797 % ) 310000 ( 673633 -> 45.8 % )
integration time:  ( 6m 6s elapsed / 0s left ) [07:43:41]   
2_2__j__j__j__j : 7.62597e+08 pb +- ( 2.35488e+06 pb = 0.308797 % )  exp. eff: 0.267886 %
  reduce max for 2_2__j__j__j__j to 0.734463 ( eps = 0.001 ) 
Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative       : Signal_Processes
Perturbative       : Hard_Decays
Perturbative       : Jet_Evolution:CSS
Perturbative       : Lepton_FS_QED_Corrections:Photons
Perturbative       : Multiple_Interactions:Amisic
Perturbative       : Minimum_Bias:Off
Hadronization      : Beam_Remnants
Hadronization      : Hadronization:Ahadic
Hadronization      : Hadron_Decays
Analysis           : HepMC2
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

...
etc etc etc
ID: 35203 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Theory Application : Theory's endless looping


©2024 CERN