log in

Theory's endless looping


Advanced search

Message boards : Theory Application : Theory's endless looping

1 · 2 · Next
Author Message
Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 27978 - Posted: 28 Nov 2016, 7:39:59 UTC

The first one on LHC@home... ...and of course it's a sherpa:

===> [runRivet] Mon Nov 28 01:40:08 CET 2016 [boinc pp uemb-soft 53 - - sherpa 2.1.1 default 2000 736]
.
.
.
integration time: ( 10m 15s elapsed / 21s left ) [02:03:44]
7.54237e+08 pb +- ( 2.34247e+06 pb = 0.310575 % ) 310000 ( 662951 -> 46.4 % )
integration time: ( 10m 38s elapsed / 0s left ) [02:04:07]
2_2__j__j__j__j : 7.54237e+08 pb +- ( 2.34247e+06 pb = 0.310575 % ) exp. eff: 0.486253 %
reduce max for 2_2__j__j__j__j to 0.693577 ( eps = 0.001 )
Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative : Signal_Processes
Perturbative : Hard_Decays
Perturbative : Jet_Evolution:CSS
Perturbative : Lepton_FS_QED_Corrections:Photons
Perturbative : Multiple_Interactions:Amisic
Perturbative : Minimum_Bias:Off
Hadronization : Beam_Remnants
Hadronization : Hadronization:Ahadic
Hadronization : Hadron_Decays
Analysis : HepMC2
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

etc etc etc

Profile Ben Segal
Volunteer moderator
Project administrator
Send message
Joined: 1 Sep 04
Posts: 84
Credit: 2,579
RAC: 0
Message 27984 - Posted: 28 Nov 2016, 10:26:25 UTC - in response to Message 27978.

Met Peter Skands at CERN last Thursday and Laurence and I told him about looping and failing Theory jobs that we see sometimes. He said that the suite of generator code needs updating and that this is work in progress, but takes a lot of effort.
____________

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 27985 - Posted: 28 Nov 2016, 11:08:38 UTC - in response to Message 27984.

Met Peter Skands at CERN last Thursday and Laurence and I told him about looping and failing Theory jobs that we see sometimes. He said that the suite of generator code needs updating and that this is work in progress, but takes a lot of effort.

Thanks Ben. I know the team is looking after it.
But for the 'new' crunchers over here, it's good to know that sometimes a Theory-job within the VM may run endless and will only be stopped by the maximum BOINC-task time of 18 hours.

btw: Peter Up Above for a visit ...

Profile MAGIC Quantum Mechanic
Avatar
Send message
Joined: 24 Oct 04
Posts: 512
Credit: 14,661,078
RAC: 3,859
Message 27998 - Posted: 29 Nov 2016, 12:39:59 UTC

https://indico.cern.ch/event/557400/

the things I do at 4:30 am

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 28097 - Posted: 12 Dec 2016, 16:59:46 UTC

Another one:

===> [runRivet] Mon Dec 12 17:09:51 CET 2016 [boinc ee zhad 91.2 - - sherpa 1.4.5 default 80000 756]
.
.
.
Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix)
Starting the calculation. Lean back and enjoy ... .
and then 32 times
Exception_Handler::GenerateStackTrace(..): Generating stack trace
{
}

Exception_Handler::SignalHandler: Signal (6) caught.
Cannot continue.
Exception_Handler::GenerateStackTrace(..): Generating stack trace
{
followed by endless
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

Profile ritterm
Avatar
Send message
Joined: 30 May 08
Posts: 69
Credit: 3,384,910
RAC: 0
Message 28099 - Posted: 13 Dec 2016, 14:34:09 UTC - in response to Message 27985.

Crystal Pellet wrote:
But for the 'new' crunchers over here, it's good to know that sometimes a Theory-job within the VM may run endless and will only be stopped by the maximum BOINC-task time of 18 hours...

I'm not sure this is relevant, but two of the Theory tasks I've completed recently (109863415 and 109863451) show a difference between the run time and CPU time of 9-10 hours. Are these examples of what Crystal Pellet is referring to? Would the project admins/scientists like to have these examples brought to their attention?

Regards,

MarkR
____________

Profile Ben Segal
Volunteer moderator
Project administrator
Send message
Joined: 1 Sep 04
Posts: 84
Credit: 2,579
RAC: 0
Message 28101 - Posted: 13 Dec 2016, 14:53:52 UTC - in response to Message 28099.

I will ask them to look at these issues.

Thanks for the input!
____________

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 28102 - Posted: 13 Dec 2016, 18:05:08 UTC - in response to Message 28099.

... show a difference between the run time and CPU time of 9-10 hours. Are these examples of what Crystal Pellet is referring to?

Most of the endless loopings I've seen, the job needs the normal CPU-load, so a big difference between elapsed time and used cpu must have another cause.

With your first mentioned task (109863415), the VM seems to have had a problem with the 6th job.
Not sure what, but when you have the time for baby sitting, next time you can access the log files inside the VM with BOINC Manager's 'Show graphics'.

The second task you mentioned (109863451) seems to be another problem, I also have seen several times.
The VM did not get a new job after the last result was uploaded.
Normally the VM should be stopped after about 10 minutes idling, but it not always do.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 28141 - Posted: 18 Dec 2016, 14:33:39 UTC - in response to Message 28097.

Another one:

===> [runRivet] Sun Dec 18 14:41:25 CET 2016 [boinc ee zhad 197 - - sherpa 1.4.5 default 100000 752]
.
.
.
Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix)
Starting the calculation. Lean back and enjoy ... .
and then 32 times
Exception_Handler::GenerateStackTrace(..): Generating stack trace
{
}

Exception_Handler::SignalHandler: Signal (6) caught.
Cannot continue.
Exception_Handler::GenerateStackTrace(..): Generating stack trace
{
followed by endless
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 28223 - Posted: 23 Dec 2016, 12:11:05 UTC

Have seen looping another sherpa with 2000 events before. Another one:

===> [runRivet] Fri Dec 23 12:19:07 CET 2016 [boinc ppbar uemb-soft 53 - - sherpa 2.1.0 default 2000 766]
.
.
.
7.60168e+08 pb +- ( 2.32687e+06 pb = 0.3061 % ) 300000 ( 682278 -> 43.4 % )
integration time: ( 9m 50s (9m 30s) elapsed / 19s (19s) left ) [12:40:54]
7.60399e+08 pb +- ( 2.28161e+06 pb = 0.300054 % ) 310000 ( 705049 -> 43.4 % )
integration time: ( 10m 11s (9m 51s) elapsed / 0s (0s) left ) [12:41:16]
2_2__j__j__j__j : 7.60399e+08 pb +- ( 2.28161e+06 pb = 0.300054 % ) exp. eff: 0.396514 %
reduce max for 2_2__j__j__j__j to 0.564752 ( eps = 0.001 )
Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative : Signal_Processes
Perturbative : Hard_Decays
Perturbative : Jet_Evolution:CSS
Perturbative : Lepton_FS_QED_Corrections:Photons
Perturbative : Multiple_Interactions:Amisic
Perturbative : Minimum_Bias:Off
Hadronization : Beam_Remnants
Hadronization : Hadronization:Ahadic
Hadronization : Hadron_Decays
Analysis : HepMC2
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

etc etc etc

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 28306 - Posted: 2 Jan 2017, 11:02:35 UTC - in response to Message 28223.
Last modified: 7 Jan 2017, 11:08:43 UTC

Have seen looping another sherpa with 2000 events before. Another one:

===> [runRivet] Fri Dec 23 12:19:07 CET 2016 [boinc ppbar uemb-soft 53 - - sherpa 2.1.0 default 2000 766]


...and again:

===> [runRivet] Mon Jan 2 11:13:49 CET 2017 [boinc ppbar uemb-soft 53 - - sherpa 2.1.0 default 2000 774]

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 29425 - Posted: 20 Mar 2017, 6:31:58 UTC

===> [runRivet] Sun Mar 19 22:07:58 CET 2017 [boinc pp uemb-hard 900 - - pythia8 8.165 default-MBR 100000 832]
.
.
.
Pythia::next(): 65000 events have been generated
65000 events processed
dumping histograms...
65100 events processed
65200 events processed
65300 events processed
65400 events processed
65500 events processed
65600 events processed
65700 events processed
65800 events processed
65900 events processed
Updating display...
Display update finished (6 histograms, 65000 events).
Updating display...
Display update finished (6 histograms, 65000 events).
Updating display...
Display update finished (6 histograms, 65000 events).
Updating display...
Display update finished (6 histograms, 65000 events).

etc etc etc

Juha
Send message
Joined: 22 Mar 17
Posts: 24
Credit: 231,948
RAC: 611
Message 30294 - Posted: 11 May 2017, 17:32:37 UTC

Looks like I have one of these.

Condor JobID: 3087024.0 MCPlots JobID: 36498619


===> [runRivet] Thu May 11 11:54:56 EEST 2017 [boinc ee zhad 200 - - sherpa 1.4.5 default 59000 890]


2.71157 pb +- ( 0.0134118 pb = 0.494614 % ) 310000 ( 365433 -> 84.9 % ) integration time: ( 2m 14s(2m 3s) elapsed / 0s(0s) left ) 2_4__e-__e+__j__j__j__j : 2.71157 pb +- ( 0.0134118 pb = 0.494614 % ) exp. eff: 0.375051 % reduce max for 2_4__e-__e+__j__j__j__j to 0.768368 ( eps = 0.001 ) Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix) Starting the calculation. Lean back and enjoy ... . Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::SignalHandler: Signal (6) caught. Cannot continue. Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::GenerateStackTrace(..): Generating stack trace { }


Repeated multiple times.

Updating display... Display update finished (0 histograms, 0 events).


Repeated once per minute or so.

Isn't using any CPU any more. Guess I'll just reset the VM.

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,142
RAC: 1,838
Message 30795 - Posted: 15 Jun 2017, 18:06:00 UTC

Moved from here.

I shut down the WU after 15 h walltime via "touch shutdown" in the shared folder.
At shutdown the Sherpa job was running for nearly 13 h.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=145728371

2017-06-14 21:15:09 (28725): Guest Log: [INFO] New Job Starting in slot1
2017-06-14 21:15:09 (28725): Guest Log: [INFO] Condor JobID: 3574607.0 in slot1
2017-06-14 21:15:14 (28725): Guest Log: [INFO] MCPlots JobID: 36820662 in slot1
.
.
.
2017-06-15 10:09:39 (28725): VM Completion File Detected.



The currently running WU (other host) is also a Theory with a walltime of 11 h.
After 12 successful jobs a Sherpa started 0.5 h ago and shows the same output:
Updating display...
Display update finished (0 histograms, 0 events).


https://lhcathome.cern.ch/lhcathome/result.php?resultid=145900136
2017-06-15 19:17:20 (20946): Guest Log: [INFO] New Job Starting in slot1
2017-06-15 19:17:20 (20946): Guest Log: [INFO] Condor JobID: 3548284.0 in slot1
2017-06-15 19:17:25 (20946): Guest Log: [INFO] MCPlots JobID: 35798910 in slot1


I will cancel the WU.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 30881 - Posted: 19 Jun 2017, 16:17:00 UTC

Another one:

===> [runRivet] Mon Jun 19 09:06:07 CEST 2017 [boinc pp jets 7000 10 - sherpa 1.2.2p default 91000 960]
.
.
.
43800 events processed
Event 43900 ( 3h 45m 13s elapsed / 4h 1m 38s left ) -> ETA: Mon Jun 19 19:25
43900 events processed
Display update finished (118 histograms, 43000 events).
Event 44000 ( 3h 45m 36s elapsed / 4h 59s left ) -> ETA: Mon Jun 19 19:25
44000 events processed
dumping histograms...
Event 44100 ( 3h 46m elapsed / 4h 20s left ) -> ETA: Mon Jun 19 19:25
44100 events processed
Updating display...
Event 44200 ( 3h 46m 30s elapsed / 3h 59m 50s left ) -> ETA: Mon Jun 19 19:25
44200 events processed
Display update finished (118 histograms, 44000 events).
Updating display...
Display update finished (118 histograms, 44000 events).
Updating display...
Display update finished (118 histograms, 44000 events).
Updating display...
Display update finished (118 histograms, 44000 events).
Updating display...

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,142
RAC: 1,838
Message 31124 - Posted: 27 Jun 2017, 4:17:37 UTC

Again a sherpa longrunner with 0 output:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=147979345

Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...

2017-06-26 22:53:41 (11258): Guest Log: [INFO] Condor JobID: 3615916.0 in slot1
2017-06-26 22:53:47 (11258): Guest Log: [INFO] MCPlots JobID: 37338018 in slot1
2017-06-27 06:10:04 (11258): Powering off VM.

Crystal Pellet
Volunteer moderator
Volunteer tester
Send message
Joined: 14 Jan 10
Posts: 384
Credit: 2,997,020
RAC: 1,945
Message 31128 - Posted: 27 Jun 2017, 9:11:08 UTC

After the full optimization phase the job did not start processing events.
Only updating display.... without having processed events for over 9 hours.

===> [runRivet] Tue Jun 27 01:02:43 CEST 2017 [boinc ppbar uemb-soft 53 - - sherpa 2.1.0 default 3000 928]
.
.
.
7.62475e+08 pb +- ( 2.51417e+06 pb = 0.329738 % ) 280000 ( 590462 -> 47 % )
integration time: ( 37m 59s (34m 47s) elapsed / 4m 5s (3m 44s) left ) [02:23:11]
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
7.62857e+08 pb +- ( 2.41552e+06 pb = 0.316641 % ) 300000 ( 632958 -> 47 % )
integration time: ( 40m 54s (37m 27s) elapsed / 1m 22s (1m 15s) left ) [02:26:06]
Updating display...
Display update finished (0 histograms, 0 events).
7.62407e+08 pb +- ( 2.3686e+06 pb = 0.310674 % ) 310000 ( 654232 -> 47 % )
integration time: ( 42m 22s (38m 48s) elapsed / 0s (0s) left ) [02:27:33]
2_2__j__j__j__j : 7.62407e+08 pb +- ( 2.3686e+06 pb = 0.310674 % ) exp. eff: 0.506582 %
reduce max for 2_2__j__j__j__j to 0.673671 ( eps = 0.001 )
Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative : Signal_Processes
Perturbative : Hard_Decays
Perturbative : Jet_Evolution:CSS
Perturbative : Lepton_FS_QED_Corrections:Photons
Perturbative : Multiple_Interactions:Amisic
Perturbative : Minimum_Bias:Off
Hadronization : Beam_Remnants
Hadronization : Hadronization:Ahadic
Hadronization : Hadron_Decays
Analysis : HepMC2
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

Profile rbpeake
Send message
Joined: 17 Sep 04
Posts: 55
Credit: 15,620,725
RAC: 3
Message 31168 - Posted: 29 Jun 2017, 0:11:44 UTC

Just want to be sure this is normal SHERPA output:

    -- SHERPA generates events with the following structure --
    ----------------------------------------------------------
    Perturbative : Signal_Processes
    Perturbative : Hard_Decays
    Perturbative : Jet_Evolution:CSS
    Perturbative : Lepton_FS_QED_Corrections:Photons
    Perturbative : Multiple_Interactions:Amisic
    Perturbative : Minimum_Bias:Off
    Hadronization : Beam_Remnants
    Hadronization : Hadronization:Ahadic
    Hadronization : Hadron_Decays
    Analysis : HepMC2
    Updating display...
    Display update finished (0 histograms, 0 events).
    Updating display...
    Display update finished (0 histograms, 0 events).
    Updating display...
    Display update finished (0 histograms, 0 events).
    Updating display...
    Display update finished (0 histograms, 0 events).
    Updating display...
    Display update finished (0 histograms, 0 events).
    Updating display...
    Display update finished (0 histograms, 0 events).
    Updating display...
    Display update finished (0 histograms, 0 events).


and so forth.

Thanks!
____________
Regards,
Bob P.

Profile Ben Segal
Volunteer moderator
Project administrator
Send message
Joined: 1 Sep 04
Posts: 84
Credit: 2,579
RAC: 0
Message 31202 - Posted: 30 Jun 2017, 16:28:39 UTC - in response to Message 31168.

The Sherpa scientists have been contacted and are discussing the best way to deal with these cases. They are rare but the system does try to "learn" from their occurrences and improve subsequent parameter choices - so they are not entirely wasted effort on your part.

Thanks for all your inputs on this topic (and all your crunching!)...
____________

computezrmle
Send message
Joined: 15 Jun 08
Posts: 347
Credit: 3,501,142
RAC: 1,838
Message 31203 - Posted: 30 Jun 2017, 17:18:00 UTC

My currently running sherpa shows huge numbers of the following error:

METS_Scale_Setter::SetScales(): Failed to determine \mu.

I let it run for the moment as the number of processed events slowly increases but I doubt that the WU will finish before the 18 h limit.
current sherpa runtime: 309 s
processed events: 22000
estimated runtime: 23.4 h

Job data from stderr.txt:
2017-06-30 13:21:38 (7625): Guest Log: [INFO] New Job Starting in slot1 2017-06-30 13:21:38 (7625): Guest Log: [INFO] Condor JobID: 3831724.0 in slot1 2017-06-30 13:21:44 (7625): Guest Log: [INFO] MCPlots JobID: 37551391 in slot1

Any advice? [cancel|let it run]

1 · 2 · Next

Message boards : Theory Application : Theory's endless looping