Message boards : Theory Application : How long may Native-Theory-Tasks run
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,576,736
RAC: 5,548
Message 48150 - Posted: 30 May 2023, 9:09:45 UTC
Last modified: 30 May 2023, 9:10:03 UTC

I have opened my Native-Atlas-Clients for Native-Theory and see wide varyiung runtimes.

From 00:20 hours to 02:45 hours seem to be fine, but sometimes I see runtimes from 20:00 or even more hours, sometimes with 99% CPU-Cycle, sometimes with no CPU-Cycle.

Can I see, if the tasks are alive and doing fine or should I abort them if longer than XX:00 Hours ?


Supporting BOINC, a great concept !
ID: 48150 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2423
Credit: 227,369,392
RAC: 130,500
Message 48151 - Posted: 30 May 2023, 9:29:44 UTC - in response to Message 48150.  

Call this page:
http://mcplots-dev.cern.ch/production.php?view=control
Follow the link in col "coverage" of the current revision (currently 2390)
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390

Takes a while, be patient (... more patient).
The page you get includes a runtime histogram.

Theory native logs can be checked, e.g for a task running in slot 0
.../slots/0/cernvm/shared/runRivet.log

Runtimes can be between a few minutes and a couple of days.
Long runtimes don't necessarily indicate an error.
ID: 48151 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2120
Credit: 159,924,350
RAC: 80,112
Message 48153 - Posted: 30 May 2023, 10:56:41 UTC - in response to Message 48150.  

Can I see, if the tasks are alive and doing fine or should I abort them if longer than XX:00 Hours ?


Have one now 6 day running (41000 from 49.000 events finished - max. is 10 days for Theory)
ID: 48153 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 167
Credit: 14,945,019
RAC: 255
Message 48154 - Posted: 30 May 2023, 12:38:05 UTC - in response to Message 48151.  

Theory native logs can be checked, e.g for a task running in slot 0
.../slots/0/cernvm/shared/runRivet.log

If you "head" the runRivet.log, it will tell you the code in use and how many events that specific task is to generate:
[boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 57000 482]
57k in this case. If you then "tail" the log you can see how far it's got and if it's making progress...
ID: 48154 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 126
Credit: 53,906,164
RAC: 31,876
Message 48576 - Posted: 19 Sep 2023, 11:45:07 UTC - in response to Message 48154.  
Last modified: 19 Sep 2023, 11:51:57 UTC

Theory native logs can be checked, e.g for a task running in slot 0
.../slots/0/cernvm/shared/runRivet.log

If you "head" the runRivet.log, it will tell you the code in use and how many events that specific task is to generate:
[boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 57000 482]
57k in this case. If you then "tail" the log you can see how far it's got and if it's making progress...
Thanks for that explanation. That's a lot of work to expect from a BOINC user to decide if the WU will ever finish.
The problem is that I have many where the progress is being reported as over 98% and looking at the end the WU's runRivet.log it says it's completed 63000 out of 100000 events. That should display to us 63% progress and not 98.563%.
If progress was reported accurately then folks would let the task s run. But when they see it seem to stall at over 98% for many hours they assume something is wrong and abort the WU.
Hopefully CERN will fix this progress reporting bug soon. Expect many aborted tasks in the meantime.

The other problem is that these Theory tasks don't checkpoint. I for one am on Time-of-Use electric service and my electric rate increases 10x during peak hours. If I can't suspend and resume from a checkpoint the task will get aborted when I do a daily TOU shutdown.
ID: 48576 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2423
Credit: 227,369,392
RAC: 130,500
Message 48577 - Posted: 19 Sep 2023, 12:00:43 UTC - in response to Message 48576.  

This is not a bug, hence CERN will never "fix" this.

What you compare is BOINC's progress estimation with the logfile entries of a family of scientific apps.
Most of them but not all print the #of processed events to the logfile.


Since the majority of Theory tasks finish within a couple of hours or even faster the best you can do is to be patient.
ID: 48577 · Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jun 18
Posts: 126
Credit: 53,906,164
RAC: 31,876
Message 48579 - Posted: 19 Sep 2023, 12:15:49 UTC - in response to Message 48577.  

It is a bug, a thoughtless inconsiderate bug that could fixed.
Patience would be idiotic and wasteful. You clearly did not understand my comments about wasting expensive electricity.
ID: 48579 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2423
Credit: 227,369,392
RAC: 130,500
Message 48580 - Posted: 19 Sep 2023, 12:31:50 UTC - in response to Message 48579.  

As said:
CERN will not solve this.

If you still think it is a bug, then clearly describe it and open an issue at github.


Beside that it would be easy to run a oneliner like this:
find /your/boinc/working/dir/slots -type f -name "runRivet.log" -mmin +180 |xargs -I {} ls -hal {}

This prints all candidates where Theory did not update runRivet.log within the last 180 min (=> might hang).
Now inspect just those candidates.

A few lines more and it tests the whole server farm from your desktop.
ID: 48580 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,874,936
RAC: 17,795
Message 49861 - Posted: 2 Apr 2024, 6:54:07 UTC
Last modified: 2 Apr 2024, 6:54:40 UTC

I have one Theory unit that's been running for three and a half days. I just left it to do it's thing. Today I got curious and this is in the runRivet.log

HERWIGPP=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt
Run herwig++ 2.5.1 ...
generatorExecString = /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/bin/Herwig++ read -r /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/share/Herwig++/HerwigDefaults.rpo /shared/tmp/tmp.IPuslKhFRO/generator.params
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> ThePEG - Toolkit for HEP Event Generation - version 1.7.1 <<<<<<<<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
No more warnings of this kind will be reported.

There is nothing after this.

The work unit name is Theory_2743-2857700-30

This looks like a dead unit to me, but I'm hardly an expert. I've been very careful to not pause or restart the unit in any way, and I haven't been fiddling about with system installed packages or filesystems lately, so I don't know what might cause this. Other units are completing successful while this one just sits there.

Should I just let it run or kill it?
ID: 49861 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2120
Credit: 159,924,350
RAC: 80,112
Message 49863 - Posted: 2 Apr 2024, 7:28:10 UTC - in response to Message 49861.  

It's like fog. You can cancel it or waiting for the hard stop after 10 days ;-)
ID: 49863 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,874,936
RAC: 17,795
Message 49864 - Posted: 2 Apr 2024, 7:32:42 UTC - in response to Message 49863.  

I don't follow.

Other Theory tasks I have running are processing events normally and showing them in their respective logs but this one is different.

Is this indicative of a failure and I should abort the task or is this just another normal variation I haven't happened to see before?
ID: 49864 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2120
Credit: 159,924,350
RAC: 80,112
Message 49867 - Posted: 2 Apr 2024, 8:04:54 UTC - in response to Message 49864.  
Last modified: 2 Apr 2024, 8:06:14 UTC

http://mcplots-dev.cern.ch/production.php?view=revision&rev=2743
Theory have hundreds of working tasks with difficult working parameter.
mcplots-dev must be started new from default homepage, because of revision.
ID: 49867 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2423
Credit: 227,369,392
RAC: 130,500
Message 49868 - Posted: 2 Apr 2024, 8:47:36 UTC - in response to Message 49861.  

This works for most (but not all!) Theory native tasks:

1.
Get the "last modified time" from runRivet.log.

2.
Check the 1st line of runRivet.log.
There you find the starting time and the number of events to be processed (marked bold).
===> [runRivet] Sun Mar 31 17:37:53 UTC 2024 [boinc pp jets 8000 100 - pythia8 8.212 tune-AU2ct10 100000 34]

3.
Locate the last line that looks somehow like this
74100 events processed

4.
Calculate the estimated remaining time from those values.
Ignore the BOINC progress estimation. It can't look into the logs.


If there are no "processed" lines at all or no new lines for many hours, then the task most likely got stuck.
=> abort it


Pitfalls:
- most but not all tasks run 100000 events
- certain tasks run through a very long setup phase to configure the environment.
=> you will see no "processed" lines for many hours, but then they appear rapidly
- in rare cases you get a scientific app that does not even print a single "processed" line.
Those logs look different to most regular ones.
ID: 49868 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,874,936
RAC: 17,795
Message 49869 - Posted: 2 Apr 2024, 8:51:06 UTC - in response to Message 49868.  

This is the complete log file:

===> [runRivet] Fri Mar 29 15:11:45 UTC 2024 [boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30]

Setting environment...
INFO: uname:
Linux runc 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
INFO: /etc/redhat-release:
cat: /etc/redhat-release: No such file or directory

MCGENERATORS=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators
g++ = /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/x86_64-centos7/bin/g++
g++ version = 11.2.0
RIVET=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt
YODA=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/yoda/1.9.10/x86_64-centos7-gcc11-opt
Rivet version = rivet v3.1.10
RIVET_ANALYSIS_PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/lib/Rivet:/shared/analyses
RIVET_DATA_PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/share/Rivet:/shared/analyses
GSL=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/GSL/2.7/x86_64-centos7-gcc11-opt
HEPMC=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/HepMC/2.06.11/x86_64-centos7-gcc11-opt
FASTJET=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/fastjet/3.4.1/x86_64-centos7-gcc11-opt
PYTHON=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/Python/3.9.12/x86_64-centos7-gcc11-opt

Input parameters:
mode=boinc
beam=pp
process=z1j
energy=8000
params=-
specific=-
generator=herwig++
version=2.5.1
tune=LHC-UE-EE-2-2760
nevts=100000
seed=30

Prepare temporary directories and files ...
workd=/shared
tmpd=/shared/tmp/tmp.IPuslKhFRO
tmp_params=/shared/tmp/tmp.IPuslKhFRO/generator.params
tmp_hepmc=/shared/tmp/tmp.IPuslKhFRO/generator.hepmc
tmp_yoda=/shared/tmp/tmp.IPuslKhFRO/generator.yoda
tmp_jobs=/shared/tmp/tmp.IPuslKhFRO/jobs.log
tmpd_flat=/shared/tmp/tmp.IPuslKhFRO/flat
tmpd_dump=/shared/tmp/tmp.IPuslKhFRO/dump
tmpd_rivetdb=/shared/tmp/tmp.IPuslKhFRO/rivetdb.map

Prepare Rivet parameters ...
Total histograms selected: 1
analysesNames=ATLAS_2019_I1744201
Total analyses selected: 1
analysesBaseNames=ATLAS_2019_I1744201
Total base analyses selected: 1

Unpack data histograms...
dataFiles =
/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/share/Rivet/ATLAS_2019_I1744201.yoda.gz
output = /shared/tmp/tmp.IPuslKhFRO/flat
make: Entering directory `/shared/rivetvm'
g++ yoda2flat-split.cc -o yoda2flat-split.exe -Wfatal-errors -Wl,-rpath /cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/yoda/1.9.10/x86_64-centos7-gcc11-opt/lib `/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/yoda/1.9.10/x86_64-centos7-gcc11-opt/bin/yoda-config --cppflags --libs`
make: Leaving directory `/shared/rivetvm'

Total histograms unpacked=20 / selected=1
complete ./REF_ATLAS_2019_I1744201_d02-x01-y01.dat

Building rivetvm ...
make: Entering directory `/shared/rivetvm'
g++ rivetvm.cc -o rivetvm.exe -DNDEBUG -Wfatal-errors -Wl,-rpath /cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/lib -Wl,-rpath /cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/HepMC/2.06.11/x86_64-centos7-gcc11-opt/lib `/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/bin/rivet-config --cppflags --ldflags --libs` -lHepMC
make: Leaving directory `/shared/rivetvm'

Run herwig++ 2.5.1 and Rivet ...
generatorExecString = ./rungen.sh boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30 /shared/tmp/tmp.IPuslKhFRO/generator.hepmc
rivetExecString = /shared/rivetvm/rivetvm.exe -a ATLAS_2019_I1744201 -i /shared/tmp/tmp.IPuslKhFRO/generator.hepmc -o /shared/tmp/tmp.IPuslKhFRO/flat -H /shared/tmp/tmp.IPuslKhFRO/generator.yoda -d /shared/tmp/tmp.IPuslKhFRO/dump
INFO: (display) T4T_DISPLAY=
INFO: (display) datdir=/shared/tmp/tmp.IPuslKhFRO/dump
INFO: (display) vars=pp z1j 8000 - herwig++ 2.5.1 LHC-UE-EE-2-2760
INFO: display service switched off
===> [rungen] Fri Mar 29 15:11:56 UTC 2024 [boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30 /shared/tmp/tmp.IPuslKhFRO/generator.hepmc]

Setting environment for herwig++ 2.5.1 ...
tree = hepmc2.06.05
tag =

grep: /etc/redhat-release: No such file or directory
MCGENERATORS=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05
LCG_PLATFORM=x86_64-slc5-gcc43-opt
g++ = /shared/tmp/tmp.eldt4Q5K7G/g++
g++ version = 4.3.6
g++ orig = /cvmfs/sft.cern.ch/lcg/external/gcc/4.3.6/x86_64-slc5/bin/g++
AGILE=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/agile/1.4.0/x86_64-slc5-gcc43-opt
HEPMC=/cvmfs/sft.cern.ch/lcg/external/HepMC/2.06.05/x86_64-slc5-gcc43-opt
AGILE_GEN_PATH=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05
LHAPDF=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/lhapdf/5.8.9/x86_64-slc5-gcc43-opt

grep: /etc/redhat-release: No such file or directory
INFO: EL9/CC7 compat: herwig++ - added work-around for missing libraries:
-rwxr-xr-x 1 0 0 7504 Mar 29 15:11 empty.so
lrwxrwxrwx 1 0 0 8 Mar 29 15:11 libreadline.so.5 -> empty.so
lrwxrwxrwx 1 0 0 8 Mar 29 15:11 libtermcap.so.2 -> empty.so
/shared

Input parameters:
mode=boinc
beam=pp
process=z1j
energy=8000
params=-
specific=-
generator=herwig++
version=2.5.1
tune=LHC-UE-EE-2-2760
nevts=100000
seed=30
outfile=/shared/tmp/tmp.IPuslKhFRO/generator.hepmc

Prepare temporary directories and files ...
workd=/shared
tmpd=/shared/tmp/tmp.IPuslKhFRO
tmp_params=/shared/tmp/tmp.IPuslKhFRO/generator.params

Decoding parameters of generator...
pTmin = 0
pTmax = 8000
mHatMin = 0
mHatMax = 8000

processCode=z1j

beam1=p+
beam2=p+
beam energy = 4000.
INFO: steering file template = configuration/herwig++-z1j.params
INFO: cache is not active, CACHE=
Prepare herwig++ 2.5.1 parameters ...
=> /shared/tmp/tmp.IPuslKhFRO/generator.params :
# based on example from Herwig++ 2.4.2 distribution:
# share/Herwig++/TVT.in

# Run options:
cd /Herwig/Generators
set LHCGenerator:NumberOfEvents 100000
set LHCGenerator:RandomNumberGenerator:Seed 30
set LHCGenerator:DebugLevel 0
set LHCGenerator:PrintEvent 1
set LHCGenerator:MaxErrors 100000

# redirect all log output to stdout
set LHCGenerator:UseStdout true

# do output to a HepMC file
cd /Herwig/Generators
insert LHCGenerator:AnalysisHandlers 0 /Herwig/Analysis/HepMCFile
set /Herwig/Analysis/HepMCFile:PrintEvent 1000000
set /Herwig/Analysis/HepMCFile:Format GenEvent
set /Herwig/Analysis/HepMCFile:Filename /shared/tmp/tmp.IPuslKhFRO/generator.hepmc
# set /Herwig/Analysis/HepMCFile:Units GeV_mm


# Beam parameters:
set LHCGenerator:EventHandler:LuminosityFunction:Energy 8000
set LHCGenerator:EventHandler:BeamA /Herwig/Particles/p+
set LHCGenerator:EventHandler:BeamB /Herwig/Particles/p+
set LHCGenerator:MaxErrors -1


# Process setup
# Z+1jet production
cd /Herwig/MatrixElements
insert SimpleQCD:MatrixElements[0] MEZJet
DISABLEREADONLY
newdef MEZJet:ZDecay ChargedLeptons

## Set cuts
## Use this for hard leading-jets in a certain pT window
set /Herwig/Cuts/JetKtCut:MinKT 0*GeV # minimum jet pT
set /Herwig/Cuts/JetKtCut:MaxKT 8000*GeV # maximum jet pT
#
## Use this for a certain mHat window
#set /Herwig/Cuts/QCDCuts:MHatMin 0*GeV # minimum jet mHat
#set /Herwig/Cuts/QCDCuts:MHatMax 8000*GeV # maximum jet mHat


# Make particles with c*tau > 10 mm stable:
set /Herwig/Decays/DecayHandler:MaxLifeTime 10*mm
set /Herwig/Decays/DecayHandler:LifeTimeOption Average


# tune 'LHC-UE-EE-2-2760' parameters: -------------------
#%tuneFile%
# Based on LHC tune example from Herwig++ 2.5.1 distribution
# share/Herwig++/LHC-UE-EE-2.in

##################################################
# Override default MPI parameters
##################################################


# Colour reconnection settings
set /Herwig/Hadronization/ColourReconnector:ColourReconnection Yes
set /Herwig/Hadronization/ColourReconnector:ReconnectionProbability 0.55

# Colour Disrupt settings
set /Herwig/Partons/RemnantDecayer:colourDisrupt 0.15

# inverse hadron radius
set /Herwig/UnderlyingEvent/MPIHandler:InvRadius 1.1
## for \sqrt(s) = 2760 GeV
# Min KT parameter
set /Herwig/UnderlyingEvent/KtCut:MinKT 3.31
# This should always be 2*MinKT!!
set /Herwig/UnderlyingEvent/UECuts:MHatMin 6.62


# MPI model settings
set /Herwig/UnderlyingEvent/MPIHandler:softInt Yes
set /Herwig/UnderlyingEvent/MPIHandler:twoComp Yes
set /Herwig/UnderlyingEvent/MPIHandler:DLmode 3

# ---------------------------------------------


set /Herwig/UnderlyingEvent/MPIHandler:IdenticalToUE -1

# Run generator
cd /Herwig/Generators
run TVT LHCGenerator
--------------------------------------

HERWIGPP=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt
Run herwig++ 2.5.1 ...
generatorExecString = /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/bin/Herwig++ read -r /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/share/Herwig++/HerwigDefaults.rpo /shared/tmp/tmp.IPuslKhFRO/generator.params
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> ThePEG - Toolkit for HEP Event Generation - version 1.7.1 <<<<<<<<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
No more warnings of this kind will be reported.

It appears to have never got the first event running for some reason.
ID: 49869 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2120
Credit: 159,924,350
RAC: 80,112
Message 49871 - Posted: 2 Apr 2024, 9:20:01 UTC - in response to Message 49869.  

[boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30]
Have you searched in mcplots for this task,
is it successful for other volunteers?
ID: 49871 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,874,936
RAC: 17,795
Message 49872 - Posted: 2 Apr 2024, 9:40:42 UTC - in response to Message 49871.  

Keyword: pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 (matched 1 of 202704 rows)
run events attempts success failure unknown
pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 0 1 0 0 1

It appears nobody else has run this.
ID: 49872 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,874,936
RAC: 17,795
Message 49873 - Posted: 2 Apr 2024, 9:52:25 UTC

Ok, poking around I checked the stderr.txt and found this

07:33:48 AEDT +11:00 2024-04-01: cranky-0.1.4: [INFO] Pausing container Theory_2743-2857700-30_0.

apparently something DID cause it to pause at some point and I do not have resume capability (wrong sudo version, tried installing the latest version and it crashed every unit that ran from then on so I rolled it back). Since that makes it likely that I'm the one that broke it I've aborted the unit.
ID: 49873 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2423
Credit: 227,369,392
RAC: 130,500
Message 49874 - Posted: 2 Apr 2024, 9:53:00 UTC - in response to Message 49869.  

A simple estimation:
It started Fri Mar 29 15:11:45 UTC 2024 and hasn't even processed a single event.
At the time being it should have processed >35000 events to finish before the 10 day deadline.


So, does it appear to hang?
Yes, of course it hangs.
=> cancel it.
ID: 49874 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2423
Credit: 227,369,392
RAC: 130,500
Message 49875 - Posted: 2 Apr 2024, 10:01:51 UTC - in response to Message 49873.  

Right.
Your system can't use the modern cgroups v2 method (sudo not recent enough) nor is it fully configured to use pause/resume with the old cgroups v1 based method.
The log entry just shows that cranky got a pause signal from BOINC.
ID: 49875 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,874,936
RAC: 17,795
Message 49876 - Posted: 2 Apr 2024, 10:03:24 UTC - in response to Message 49874.  

As I said, it's been aborted now.
I had only left it alone because I've had others run long but were otherwise working normally.
I don't make a habit of digging through workunit logs without cause.
ID: 49876 · Report as offensive     Reply Quote

Message boards : Theory Application : How long may Native-Theory-Tasks run


©2024 CERN