Message boards :
Theory Application :
How long may Native-Theory-Tasks run
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Sep 04 Posts: 455 Credit: 200,201,777 RAC: 46,519 |
I have opened my Native-Atlas-Clients for Native-Theory and see wide varyiung runtimes. From 00:20 hours to 02:45 hours seem to be fine, but sometimes I see runtimes from 20:00 or even more hours, sometimes with 99% CPU-Cycle, sometimes with no CPU-Cycle. Can I see, if the tasks are alive and doing fine or should I abort them if longer than XX:00 Hours ? Supporting BOINC, a great concept ! |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,911,354 RAC: 128,284 |
Call this page: http://mcplots-dev.cern.ch/production.php?view=control Follow the link in col "coverage" of the current revision (currently 2390) http://mcplots-dev.cern.ch/production.php?view=revision&rev=2390 Takes a while, be patient (... more patient). The page you get includes a runtime histogram. Theory native logs can be checked, e.g for a task running in slot 0 .../slots/0/cernvm/shared/runRivet.log Runtimes can be between a few minutes and a couple of days. Long runtimes don't necessarily indicate an error. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,797,371 RAC: 18,407 |
Can I see, if the tasks are alive and doing fine or should I abort them if longer than XX:00 Hours ? Have one now 6 day running (41000 from 49.000 events finished - max. is 10 days for Theory) |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 248 |
Theory native logs can be checked, e.g for a task running in slot 0 If you "head" the runRivet.log, it will tell you the code in use and how many events that specific task is to generate: [boinc pp jets 8000 170,-,2960 - pythia8 8.301 dire-default 57000 482] 57k in this case. If you then "tail" the log you can see how far it's got and if it's making progress... |
Send message Joined: 12 Jun 18 Posts: 126 Credit: 53,906,164 RAC: 0 |
Thanks for that explanation. That's a lot of work to expect from a BOINC user to decide if the WU will ever finish.Theory native logs can be checked, e.g for a task running in slot 0 The problem is that I have many where the progress is being reported as over 98% and looking at the end the WU's runRivet.log it says it's completed 63000 out of 100000 events. That should display to us 63% progress and not 98.563%. If progress was reported accurately then folks would let the task s run. But when they see it seem to stall at over 98% for many hours they assume something is wrong and abort the WU. Hopefully CERN will fix this progress reporting bug soon. Expect many aborted tasks in the meantime. The other problem is that these Theory tasks don't checkpoint. I for one am on Time-of-Use electric service and my electric rate increases 10x during peak hours. If I can't suspend and resume from a checkpoint the task will get aborted when I do a daily TOU shutdown. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,911,354 RAC: 128,284 |
This is not a bug, hence CERN will never "fix" this. What you compare is BOINC's progress estimation with the logfile entries of a family of scientific apps. Most of them but not all print the #of processed events to the logfile. Since the majority of Theory tasks finish within a couple of hours or even faster the best you can do is to be patient. |
Send message Joined: 12 Jun 18 Posts: 126 Credit: 53,906,164 RAC: 0 |
It is a bug, a thoughtless inconsiderate bug that could fixed. Patience would be idiotic and wasteful. You clearly did not understand my comments about wasting expensive electricity. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,911,354 RAC: 128,284 |
As said: CERN will not solve this. If you still think it is a bug, then clearly describe it and open an issue at github. Beside that it would be easy to run a oneliner like this: find /your/boinc/working/dir/slots -type f -name "runRivet.log" -mmin +180 |xargs -I {} ls -hal {} This prints all candidates where Theory did not update runRivet.log within the last 180 min (=> might hang). Now inspect just those candidates. A few lines more and it tests the whole server farm from your desktop. |
Send message Joined: 7 Aug 11 Posts: 95 Credit: 24,473,841 RAC: 29,209 |
I have one Theory unit that's been running for three and a half days. I just left it to do it's thing. Today I got curious and this is in the runRivet.log HERWIGPP=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt Run herwig++ 2.5.1 ... generatorExecString = /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/bin/Herwig++ read -r /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/share/Herwig++/HerwigDefaults.rpo /shared/tmp/tmp.IPuslKhFRO/generator.params >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> ThePEG - Toolkit for HEP Event Generation - version 1.7.1 <<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. No more warnings of this kind will be reported. There is nothing after this. The work unit name is Theory_2743-2857700-30 This looks like a dead unit to me, but I'm hardly an expert. I've been very careful to not pause or restart the unit in any way, and I haven't been fiddling about with system installed packages or filesystems lately, so I don't know what might cause this. Other units are completing successful while this one just sits there. Should I just let it run or kill it? |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,797,371 RAC: 18,407 |
It's like fog. You can cancel it or waiting for the hard stop after 10 days ;-) |
Send message Joined: 7 Aug 11 Posts: 95 Credit: 24,473,841 RAC: 29,209 |
I don't follow. Other Theory tasks I have running are processing events normally and showing them in their respective logs but this one is different. Is this indicative of a failure and I should abort the task or is this just another normal variation I haven't happened to see before? |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,797,371 RAC: 18,407 |
http://mcplots-dev.cern.ch/production.php?view=revision&rev=2743 Theory have hundreds of working tasks with difficult working parameter. mcplots-dev must be started new from default homepage, because of revision. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,911,354 RAC: 128,284 |
This works for most (but not all!) Theory native tasks: 1. Get the "last modified time" from runRivet.log. 2. Check the 1st line of runRivet.log. There you find the starting time and the number of events to be processed (marked bold). ===> [runRivet] Sun Mar 31 17:37:53 UTC 2024 [boinc pp jets 8000 100 - pythia8 8.212 tune-AU2ct10 100000 34] 3. Locate the last line that looks somehow like this 74100 events processed 4. Calculate the estimated remaining time from those values. Ignore the BOINC progress estimation. It can't look into the logs. If there are no "processed" lines at all or no new lines for many hours, then the task most likely got stuck. => abort it Pitfalls: - most but not all tasks run 100000 events - certain tasks run through a very long setup phase to configure the environment. => you will see no "processed" lines for many hours, but then they appear rapidly - in rare cases you get a scientific app that does not even print a single "processed" line. Those logs look different to most regular ones. |
Send message Joined: 7 Aug 11 Posts: 95 Credit: 24,473,841 RAC: 29,209 |
This is the complete log file: ===> [runRivet] Fri Mar 29 15:11:45 UTC 2024 [boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30] Setting environment... INFO: uname: Linux runc 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux INFO: /etc/redhat-release: cat: /etc/redhat-release: No such file or directory MCGENERATORS=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators g++ = /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/x86_64-centos7/bin/g++ g++ version = 11.2.0 RIVET=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt YODA=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/yoda/1.9.10/x86_64-centos7-gcc11-opt Rivet version = rivet v3.1.10 RIVET_ANALYSIS_PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/lib/Rivet:/shared/analyses RIVET_DATA_PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/share/Rivet:/shared/analyses GSL=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/GSL/2.7/x86_64-centos7-gcc11-opt HEPMC=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/HepMC/2.06.11/x86_64-centos7-gcc11-opt FASTJET=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/fastjet/3.4.1/x86_64-centos7-gcc11-opt PYTHON=/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/Python/3.9.12/x86_64-centos7-gcc11-opt Input parameters: mode=boinc beam=pp process=z1j energy=8000 params=- specific=- generator=herwig++ version=2.5.1 tune=LHC-UE-EE-2-2760 nevts=100000 seed=30 Prepare temporary directories and files ... workd=/shared tmpd=/shared/tmp/tmp.IPuslKhFRO tmp_params=/shared/tmp/tmp.IPuslKhFRO/generator.params tmp_hepmc=/shared/tmp/tmp.IPuslKhFRO/generator.hepmc tmp_yoda=/shared/tmp/tmp.IPuslKhFRO/generator.yoda tmp_jobs=/shared/tmp/tmp.IPuslKhFRO/jobs.log tmpd_flat=/shared/tmp/tmp.IPuslKhFRO/flat tmpd_dump=/shared/tmp/tmp.IPuslKhFRO/dump tmpd_rivetdb=/shared/tmp/tmp.IPuslKhFRO/rivetdb.map Prepare Rivet parameters ... Total histograms selected: 1 analysesNames=ATLAS_2019_I1744201 Total analyses selected: 1 analysesBaseNames=ATLAS_2019_I1744201 Total base analyses selected: 1 Unpack data histograms... dataFiles = /cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/share/Rivet/ATLAS_2019_I1744201.yoda.gz output = /shared/tmp/tmp.IPuslKhFRO/flat make: Entering directory `/shared/rivetvm' g++ yoda2flat-split.cc -o yoda2flat-split.exe -Wfatal-errors -Wl,-rpath /cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/yoda/1.9.10/x86_64-centos7-gcc11-opt/lib `/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/yoda/1.9.10/x86_64-centos7-gcc11-opt/bin/yoda-config --cppflags --libs` make: Leaving directory `/shared/rivetvm' Total histograms unpacked=20 / selected=1 complete ./REF_ATLAS_2019_I1744201_d02-x01-y01.dat Building rivetvm ... make: Entering directory `/shared/rivetvm' g++ rivetvm.cc -o rivetvm.exe -DNDEBUG -Wfatal-errors -Wl,-rpath /cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/lib -Wl,-rpath /cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/HepMC/2.06.11/x86_64-centos7-gcc11-opt/lib `/cvmfs/sft.cern.ch/lcg/releases/LCG_104d_ATLAS_10/MCGenerators/rivet/3.1.10/x86_64-centos7-gcc11-opt/bin/rivet-config --cppflags --ldflags --libs` -lHepMC make: Leaving directory `/shared/rivetvm' Run herwig++ 2.5.1 and Rivet ... generatorExecString = ./rungen.sh boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30 /shared/tmp/tmp.IPuslKhFRO/generator.hepmc rivetExecString = /shared/rivetvm/rivetvm.exe -a ATLAS_2019_I1744201 -i /shared/tmp/tmp.IPuslKhFRO/generator.hepmc -o /shared/tmp/tmp.IPuslKhFRO/flat -H /shared/tmp/tmp.IPuslKhFRO/generator.yoda -d /shared/tmp/tmp.IPuslKhFRO/dump INFO: (display) T4T_DISPLAY= INFO: (display) datdir=/shared/tmp/tmp.IPuslKhFRO/dump INFO: (display) vars=pp z1j 8000 - herwig++ 2.5.1 LHC-UE-EE-2-2760 INFO: display service switched off ===> [rungen] Fri Mar 29 15:11:56 UTC 2024 [boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30 /shared/tmp/tmp.IPuslKhFRO/generator.hepmc] Setting environment for herwig++ 2.5.1 ... tree = hepmc2.06.05 tag = grep: /etc/redhat-release: No such file or directory MCGENERATORS=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05 LCG_PLATFORM=x86_64-slc5-gcc43-opt g++ = /shared/tmp/tmp.eldt4Q5K7G/g++ g++ version = 4.3.6 g++ orig = /cvmfs/sft.cern.ch/lcg/external/gcc/4.3.6/x86_64-slc5/bin/g++ AGILE=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/agile/1.4.0/x86_64-slc5-gcc43-opt HEPMC=/cvmfs/sft.cern.ch/lcg/external/HepMC/2.06.05/x86_64-slc5-gcc43-opt AGILE_GEN_PATH=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05 LHAPDF=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/lhapdf/5.8.9/x86_64-slc5-gcc43-opt grep: /etc/redhat-release: No such file or directory INFO: EL9/CC7 compat: herwig++ - added work-around for missing libraries: -rwxr-xr-x 1 0 0 7504 Mar 29 15:11 empty.so lrwxrwxrwx 1 0 0 8 Mar 29 15:11 libreadline.so.5 -> empty.so lrwxrwxrwx 1 0 0 8 Mar 29 15:11 libtermcap.so.2 -> empty.so /shared Input parameters: mode=boinc beam=pp process=z1j energy=8000 params=- specific=- generator=herwig++ version=2.5.1 tune=LHC-UE-EE-2-2760 nevts=100000 seed=30 outfile=/shared/tmp/tmp.IPuslKhFRO/generator.hepmc Prepare temporary directories and files ... workd=/shared tmpd=/shared/tmp/tmp.IPuslKhFRO tmp_params=/shared/tmp/tmp.IPuslKhFRO/generator.params Decoding parameters of generator... pTmin = 0 pTmax = 8000 mHatMin = 0 mHatMax = 8000 processCode=z1j beam1=p+ beam2=p+ beam energy = 4000. INFO: steering file template = configuration/herwig++-z1j.params INFO: cache is not active, CACHE= Prepare herwig++ 2.5.1 parameters ... => /shared/tmp/tmp.IPuslKhFRO/generator.params : # based on example from Herwig++ 2.4.2 distribution: # share/Herwig++/TVT.in # Run options: cd /Herwig/Generators set LHCGenerator:NumberOfEvents 100000 set LHCGenerator:RandomNumberGenerator:Seed 30 set LHCGenerator:DebugLevel 0 set LHCGenerator:PrintEvent 1 set LHCGenerator:MaxErrors 100000 # redirect all log output to stdout set LHCGenerator:UseStdout true # do output to a HepMC file cd /Herwig/Generators insert LHCGenerator:AnalysisHandlers 0 /Herwig/Analysis/HepMCFile set /Herwig/Analysis/HepMCFile:PrintEvent 1000000 set /Herwig/Analysis/HepMCFile:Format GenEvent set /Herwig/Analysis/HepMCFile:Filename /shared/tmp/tmp.IPuslKhFRO/generator.hepmc # set /Herwig/Analysis/HepMCFile:Units GeV_mm # Beam parameters: set LHCGenerator:EventHandler:LuminosityFunction:Energy 8000 set LHCGenerator:EventHandler:BeamA /Herwig/Particles/p+ set LHCGenerator:EventHandler:BeamB /Herwig/Particles/p+ set LHCGenerator:MaxErrors -1 # Process setup # Z+1jet production cd /Herwig/MatrixElements insert SimpleQCD:MatrixElements[0] MEZJet DISABLEREADONLY newdef MEZJet:ZDecay ChargedLeptons ## Set cuts ## Use this for hard leading-jets in a certain pT window set /Herwig/Cuts/JetKtCut:MinKT 0*GeV # minimum jet pT set /Herwig/Cuts/JetKtCut:MaxKT 8000*GeV # maximum jet pT # ## Use this for a certain mHat window #set /Herwig/Cuts/QCDCuts:MHatMin 0*GeV # minimum jet mHat #set /Herwig/Cuts/QCDCuts:MHatMax 8000*GeV # maximum jet mHat # Make particles with c*tau > 10 mm stable: set /Herwig/Decays/DecayHandler:MaxLifeTime 10*mm set /Herwig/Decays/DecayHandler:LifeTimeOption Average # tune 'LHC-UE-EE-2-2760' parameters: ------------------- #%tuneFile% # Based on LHC tune example from Herwig++ 2.5.1 distribution # share/Herwig++/LHC-UE-EE-2.in ################################################## # Override default MPI parameters ################################################## # Colour reconnection settings set /Herwig/Hadronization/ColourReconnector:ColourReconnection Yes set /Herwig/Hadronization/ColourReconnector:ReconnectionProbability 0.55 # Colour Disrupt settings set /Herwig/Partons/RemnantDecayer:colourDisrupt 0.15 # inverse hadron radius set /Herwig/UnderlyingEvent/MPIHandler:InvRadius 1.1 ## for \sqrt(s) = 2760 GeV # Min KT parameter set /Herwig/UnderlyingEvent/KtCut:MinKT 3.31 # This should always be 2*MinKT!! set /Herwig/UnderlyingEvent/UECuts:MHatMin 6.62 # MPI model settings set /Herwig/UnderlyingEvent/MPIHandler:softInt Yes set /Herwig/UnderlyingEvent/MPIHandler:twoComp Yes set /Herwig/UnderlyingEvent/MPIHandler:DLmode 3 # --------------------------------------------- set /Herwig/UnderlyingEvent/MPIHandler:IdenticalToUE -1 # Run generator cd /Herwig/Generators run TVT LHCGenerator -------------------------------------- HERWIGPP=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt Run herwig++ 2.5.1 ... generatorExecString = /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/bin/Herwig++ read -r /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.1/x86_64-slc5-gcc43-opt/share/Herwig++/HerwigDefaults.rpo /shared/tmp/tmp.IPuslKhFRO/generator.params >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> ThePEG - Toolkit for HEP Event Generation - version 1.7.1 <<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. ** An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. No more warnings of this kind will be reported. It appears to have never got the first event running for some reason. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,797,371 RAC: 18,407 |
[boinc pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 100000 30] Have you searched in mcplots for this task, is it successful for other volunteers? |
Send message Joined: 7 Aug 11 Posts: 95 Credit: 24,473,841 RAC: 29,209 |
Keyword: pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 (matched 1 of 202704 rows) run events attempts success failure unknown pp z1j 8000 - - herwig++ 2.5.1 LHC-UE-EE-2-2760 0 1 0 0 1 It appears nobody else has run this. |
Send message Joined: 7 Aug 11 Posts: 95 Credit: 24,473,841 RAC: 29,209 |
Ok, poking around I checked the stderr.txt and found this 07:33:48 AEDT +11:00 2024-04-01: cranky-0.1.4: [INFO] Pausing container Theory_2743-2857700-30_0. apparently something DID cause it to pause at some point and I do not have resume capability (wrong sudo version, tried installing the latest version and it crashed every unit that ran from then on so I rolled it back). Since that makes it likely that I'm the one that broke it I've aborted the unit. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,911,354 RAC: 128,284 |
A simple estimation: It started Fri Mar 29 15:11:45 UTC 2024 and hasn't even processed a single event. At the time being it should have processed >35000 events to finish before the 10 day deadline. So, does it appear to hang? Yes, of course it hangs. => cancel it. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 251,911,354 RAC: 128,284 |
Right. Your system can't use the modern cgroups v2 method (sudo not recent enough) nor is it fully configured to use pause/resume with the old cgroups v1 based method. The log entry just shows that cranky got a pause signal from BOINC. |
Send message Joined: 7 Aug 11 Posts: 95 Credit: 24,473,841 RAC: 29,209 |
As I said, it's been aborted now. I had only left it alone because I've had others run long but were otherwise working normally. I don't make a habit of digging through workunit logs without cause. |
©2024 CERN