Message boards :
Theory Application :
Theory's endless looping
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
The first one on LHC@home... ...and of course it's a sherpa: ===> [runRivet] Mon Nov 28 01:40:08 CET 2016 [boinc pp uemb-soft 53 - - sherpa 2.1.1 default 2000 736] . . . integration time: ( 10m 15s elapsed / 21s left ) [02:03:44] 7.54237e+08 pb +- ( 2.34247e+06 pb = 0.310575 % ) 310000 ( 662951 -> 46.4 % ) integration time: ( 10m 38s elapsed / 0s left ) [02:04:07] 2_2__j__j__j__j : 7.54237e+08 pb +- ( 2.34247e+06 pb = 0.310575 % ) exp. eff: 0.486253 % reduce max for 2_2__j__j__j__j to 0.693577 ( eps = 0.001 ) Output_Phase::Output_Phase(): Set output interval 1000000000 events. ---------------------------------------------------------- -- SHERPA generates events with the following structure -- ---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:Amisic Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). etc etc etc |
Send message Joined: 1 Sep 04 Posts: 139 Credit: 2,579 RAC: 0 |
Met Peter Skands at CERN last Thursday and Laurence and I told him about looping and failing Theory jobs that we see sometimes. He said that the suite of generator code needs updating and that this is work in progress, but takes a lot of effort. |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
Met Peter Skands at CERN last Thursday and Laurence and I told him about looping and failing Theory jobs that we see sometimes. He said that the suite of generator code needs updating and that this is work in progress, but takes a lot of effort. Thanks Ben. I know the team is looking after it. But for the 'new' crunchers over here, it's good to know that sometimes a Theory-job within the VM may run endless and will only be stopped by the maximum BOINC-task time of 18 hours. btw: Peter Up Above for a visit ... |
Send message Joined: 24 Oct 04 Posts: 1117 Credit: 49,723,551 RAC: 13,979 |
|
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
Another one: ===> [runRivet] Mon Dec 12 17:09:51 CET 2016 [boinc ee zhad 91.2 - - sherpa 1.4.5 default 80000 756] . . . Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix) Starting the calculation. Lean back and enjoy ... . and then 32 times Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::SignalHandler: Signal (6) caught. Cannot continue. Exception_Handler::GenerateStackTrace(..): Generating stack trace { followed by endless Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). |
Send message Joined: 30 May 08 Posts: 93 Credit: 5,160,246 RAC: 0 |
Crystal Pellet wrote: But for the 'new' crunchers over here, it's good to know that sometimes a Theory-job within the VM may run endless and will only be stopped by the maximum BOINC-task time of 18 hours... I'm not sure this is relevant, but two of the Theory tasks I've completed recently (109863415 and 109863451) show a difference between the run time and CPU time of 9-10 hours. Are these examples of what Crystal Pellet is referring to? Would the project admins/scientists like to have these examples brought to their attention? Regards, MarkR |
Send message Joined: 1 Sep 04 Posts: 139 Credit: 2,579 RAC: 0 |
I will ask them to look at these issues. Thanks for the input! |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
... show a difference between the run time and CPU time of 9-10 hours. Are these examples of what Crystal Pellet is referring to? Most of the endless loopings I've seen, the job needs the normal CPU-load, so a big difference between elapsed time and used cpu must have another cause. With your first mentioned task (109863415), the VM seems to have had a problem with the 6th job. Not sure what, but when you have the time for baby sitting, next time you can access the log files inside the VM with BOINC Manager's 'Show graphics'. The second task you mentioned (109863451) seems to be another problem, I also have seen several times. The VM did not get a new job after the last result was uploaded. Normally the VM should be stopped after about 10 minutes idling, but it not always do. |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
Another one: ===> [runRivet] Sun Dec 18 14:41:25 CET 2016 [boinc ee zhad 197 - - sherpa 1.4.5 default 100000 752] . . . Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix) Starting the calculation. Lean back and enjoy ... . and then 32 times Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::SignalHandler: Signal (6) caught. Cannot continue. Exception_Handler::GenerateStackTrace(..): Generating stack trace { followed by endless Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
Have seen looping another sherpa with 2000 events before. Another one: ===> [runRivet] Fri Dec 23 12:19:07 CET 2016 [boinc ppbar uemb-soft 53 - - sherpa 2.1.0 default 2000 766] . . . 7.60168e+08 pb +- ( 2.32687e+06 pb = 0.3061 % ) 300000 ( 682278 -> 43.4 % ) integration time: ( 9m 50s (9m 30s) elapsed / 19s (19s) left ) [12:40:54] 7.60399e+08 pb +- ( 2.28161e+06 pb = 0.300054 % ) 310000 ( 705049 -> 43.4 % ) integration time: ( 10m 11s (9m 51s) elapsed / 0s (0s) left ) [12:41:16] 2_2__j__j__j__j : 7.60399e+08 pb +- ( 2.28161e+06 pb = 0.300054 % ) exp. eff: 0.396514 % reduce max for 2_2__j__j__j__j to 0.564752 ( eps = 0.001 ) Output_Phase::Output_Phase(): Set output interval 1000000000 events. ---------------------------------------------------------- -- SHERPA generates events with the following structure -- ---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:Amisic Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). etc etc etc |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
Have seen looping another sherpa with 2000 events before. Another one: ...and again: ===> [runRivet] Mon Jan 2 11:13:49 CET 2017 [boinc ppbar uemb-soft 53 - - sherpa 2.1.0 default 2000 774] |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
===> [runRivet] Sun Mar 19 22:07:58 CET 2017 [boinc pp uemb-hard 900 - - pythia8 8.165 default-MBR 100000 832] . . . Pythia::next(): 65000 events have been generated 65000 events processed dumping histograms... 65100 events processed 65200 events processed 65300 events processed 65400 events processed 65500 events processed 65600 events processed 65700 events processed 65800 events processed 65900 events processed Updating display... Display update finished (6 histograms, 65000 events). Updating display... Display update finished (6 histograms, 65000 events). Updating display... Display update finished (6 histograms, 65000 events). Updating display... Display update finished (6 histograms, 65000 events). etc etc etc |
Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0 |
Looks like I have one of these. Condor JobID: 3087024.0 MCPlots JobID: 36498619 ===> [runRivet] Thu May 11 11:54:56 EEST 2017 [boinc ee zhad 200 - - sherpa 1.4.5 default 59000 890] 2.71157 pb +- ( 0.0134118 pb = 0.494614 % ) 310000 ( 365433 -> 84.9 % ) integration time: ( 2m 14s(2m 3s) elapsed / 0s(0s) left ) 2_4__e-__e+__j__j__j__j : 2.71157 pb +- ( 0.0134118 pb = 0.494614 % ) exp. eff: 0.375051 % reduce max for 2_4__e-__e+__j__j__j__j to 0.768368 ( eps = 0.001 ) Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix) Starting the calculation. Lean back and enjoy ... . Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::SignalHandler: Signal (6) caught. Cannot continue. Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Repeated multiple times. Updating display... Display update finished (0 histograms, 0 events). Repeated once per minute or so. Isn't using any CPU any more. Guess I'll just reset the VM. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,505,016 RAC: 125,081 |
Moved from here. I shut down the WU after 15 h walltime via "touch shutdown" in the shared folder. At shutdown the Sherpa job was running for nearly 13 h. https://lhcathome.cern.ch/lhcathome/result.php?resultid=145728371 2017-06-14 21:15:09 (28725): Guest Log: [INFO] New Job Starting in slot1 The currently running WU (other host) is also a Theory with a walltime of 11 h. After 12 successful jobs a Sherpa started 0.5 h ago and shows the same output: Updating display... https://lhcathome.cern.ch/lhcathome/result.php?resultid=145900136 2017-06-15 19:17:20 (20946): Guest Log: [INFO] New Job Starting in slot1 I will cancel the WU. |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
Another one: ===> [runRivet] Mon Jun 19 09:06:07 CEST 2017 [boinc pp jets 7000 10 - sherpa 1.2.2p default 91000 960] . . . 43800 events processed Event 43900 ( 3h 45m 13s elapsed / 4h 1m 38s left ) -> ETA: Mon Jun 19 19:25 43900 events processed Display update finished (118 histograms, 43000 events). Event 44000 ( 3h 45m 36s elapsed / 4h 59s left ) -> ETA: Mon Jun 19 19:25 44000 events processed dumping histograms... Event 44100 ( 3h 46m elapsed / 4h 20s left ) -> ETA: Mon Jun 19 19:25 44100 events processed Updating display... Event 44200 ( 3h 46m 30s elapsed / 3h 59m 50s left ) -> ETA: Mon Jun 19 19:25 44200 events processed Display update finished (118 histograms, 44000 events). Updating display... Display update finished (118 histograms, 44000 events). Updating display... Display update finished (118 histograms, 44000 events). Updating display... Display update finished (118 histograms, 44000 events). Updating display... |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,505,016 RAC: 125,081 |
Again a sherpa longrunner with 0 output: https://lhcathome.cern.ch/lhcathome/result.php?resultid=147979345 Display update finished (0 histograms, 0 events). |
Send message Joined: 14 Jan 10 Posts: 1274 Credit: 8,480,147 RAC: 2,155 |
After the full optimization phase the job did not start processing events. Only updating display.... without having processed events for over 9 hours. ===> [runRivet] Tue Jun 27 01:02:43 CEST 2017 [boinc ppbar uemb-soft 53 - - sherpa 2.1.0 default 3000 928] . . . 7.62475e+08 pb +- ( 2.51417e+06 pb = 0.329738 % ) 280000 ( 590462 -> 47 % ) integration time: ( 37m 59s (34m 47s) elapsed / 4m 5s (3m 44s) left ) [02:23:11] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 7.62857e+08 pb +- ( 2.41552e+06 pb = 0.316641 % ) 300000 ( 632958 -> 47 % ) integration time: ( 40m 54s (37m 27s) elapsed / 1m 22s (1m 15s) left ) [02:26:06] Updating display... Display update finished (0 histograms, 0 events). 7.62407e+08 pb +- ( 2.3686e+06 pb = 0.310674 % ) 310000 ( 654232 -> 47 % ) integration time: ( 42m 22s (38m 48s) elapsed / 0s (0s) left ) [02:27:33] 2_2__j__j__j__j : 7.62407e+08 pb +- ( 2.3686e+06 pb = 0.310674 % ) exp. eff: 0.506582 % reduce max for 2_2__j__j__j__j to 0.673671 ( eps = 0.001 ) Output_Phase::Output_Phase(): Set output interval 1000000000 events. ---------------------------------------------------------- -- SHERPA generates events with the following structure -- ---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:Amisic Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). |
Send message Joined: 17 Sep 04 Posts: 99 Credit: 30,645,007 RAC: 2,147 |
Just want to be sure this is normal SHERPA output:
---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:Amisic Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events).
Regards, Bob P. |
Send message Joined: 1 Sep 04 Posts: 139 Credit: 2,579 RAC: 0 |
The Sherpa scientists have been contacted and are discussing the best way to deal with these cases. They are rare but the system does try to "learn" from their occurrences and improve subsequent parameter choices - so they are not entirely wasted effort on your part. Thanks for all your inputs on this topic (and all your crunching!)... |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,505,016 RAC: 125,081 |
My currently running sherpa shows huge numbers of the following error: METS_Scale_Setter::SetScales(): Failed to determine \mu. I let it run for the moment as the number of processed events slowly increases but I doubt that the WU will finish before the 18 h limit. current sherpa runtime: 309 s processed events: 22000 estimated runtime: 23.4 h Job data from stderr.txt: 2017-06-30 13:21:38 (7625): Guest Log: [INFO] New Job Starting in slot1 2017-06-30 13:21:38 (7625): Guest Log: [INFO] Condor JobID: 3831724.0 in slot1 2017-06-30 13:21:44 (7625): Guest Log: [INFO] MCPlots JobID: 37551391 in slot1 Any advice? [cancel|let it run] |
©2024 CERN