Thread 'Hung Theory task?'

Author	Message
Glohr Send message Joined: 13 Jan 24 Posts: 45 Credit: 8,764,034 RAC: 16,388	Message 52602 - Posted: 30 Oct 2025, 6:07:37 UTC I've had a CPU bound task https://lhcathome.cern.ch/lhcathome/result.php?resultid=429904502 running for days with no apparent forward progress having this initial log line: ===> [runRivet] Mon Oct 27 09:00:12 AM UTC 2025 [boinc pp z1j 7000 250 - sherpa 1.2.3 default 100000 346] The tail of the log has been Event 37100 ( 9m 41s elapsed / 16m 26s left ) -> ETA: Mon Oct 27 09:31 37100 events processed Event 37200 ( 9m 43s elapsed / 16m 24s left ) -> ETA: Mon Oct 27 09:31 37200 events processed for more than 2 days without change. I'll abort the task shortly to free up resources for other things. My question is should I do anything differently? Is this an expected occurrence? ID: 52602 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 300,025,523 RAC: 51,540	Message 52603 - Posted: 30 Oct 2025, 6:25:17 UTC - in response to Message 52602. Yes, this task got stuck and should be cancelled. Unlike other Theory tasks Sherpas are known to have a higher risk for this. ID: 52603 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1960 Credit: 159,278,332 RAC: 47,267	Message 52661 - Posted: 15 Nov 2025, 15:09:07 UTC I have had cases lately where Theory tasks get stuck, somehow. When opening the running log, at the bottom it says "0 events processed", even after long time. example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=430315743 too bad that I found out only after 10 days, so a lot of CPU time for nothing. Any idea what's the problem? ID: 52661 · Reply Quote

Glohr Send message Joined: 13 Jan 24 Posts: 45 Credit: 8,764,034 RAC: 16,388	Message 52678 - Posted: 19 Nov 2025, 5:36:32 UTC - in response to Message 52661. At the moment I have two stuck Theory tasks. In BOINC/slots/{5\|12}/shared runRivet.log has a timestamp a few minutes after the task started but the heartbeat file has a current as of a few minutes ago timestamp. One is a sherpa task, the other is a herwig++ task. There seems to be a bug in there somewhere that ought to be fixable. If nothing else, the heartbeat code needs to verify actual progress before setting the heartbeat. Ten days of wasted CPU resource for each task isn't necessary. The first line of one runRivet.log: ===> [runRivet] Fri Nov 7 11:57:33 PM UTC 2025 [boinc pp winclusive 13000 - - sherpa 2.2.9 default 1000 363] and the tail of that one: +----------------------------------+ \| \| \| CCC OOO M M I X X \| \| C O O MM MM I X X \| \| C O O M M M I X \| \| C O O M M I X X \| \| CCC OOO M M I X X \| \| \| +==================================+ \| Color dressed Matrix Elements \| \| http://comix.freacafe.de \| \| please cite JHEP12(2008)039 \| +----------------------------------+ Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 47 MB, 6s / 6s ). Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 47 MB, 0s / 0s ). Initialized the Matrix_Element_Handler for the hard processes. Initialized the Beam_Remnant_Handler. Hadron_Decay_Map::Read: Initializing HadronDecays.dat. This may take some time. Initialized the Hadron_Decay_Handler, Decay model = Hadrons Initialized the Soft_Photon_Handler. Variations::InitialiseParametersVector(0 variations){ Named variations: } Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix) Starting the calculation at 23:59:46. Lean back and enjoy ... . The other runRivet first line: ===> [runRivet] Sat Nov 8 01:53:31 AM UTC 2025 [boinc pp z1j 8000 - - herwig++ 2.5.2 default 100000 382] and the tail of that one: Run herwig++ 2.5.2 ... generatorExecString = /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.2/x86_64-slc5-gcc43-opt/bin/Herwig++ read -r /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.2/x86_64-slc5-gcc43-opt/share/Herwig++/HerwigDefaults.rpo /scratch/tmp/tmp.hVon2UA8R0/generator.params >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> ThePEG - Toolkit for HEP Event Generation - version 1.7.2 <<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. An event exception of type ThePEG::Exception occurred while generating event number 1: Failed to generate the shower after 100 attempts in Evolver::showerHardProcess() The event will be discarded. No more warnings of this kind will be reported. These are associated with: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237411761 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237545890 ID: 52678 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 300,025,523 RAC: 51,540	Message 52679 - Posted: 19 Nov 2025, 7:17:01 UTC - in response to Message 52678. There's no reason to claim for a waste since your computer shows an overall fail ratio of less than 1 % in mcplots. Even those long runners without a valid result are of scientific worth since they may indicate problematic parameter sets. As for the heartbeat code: It simply 'touches' the heartbeat file periodically via cron and never does any checks. It's goal is to identify low level issues related to the VM itself, e.g. kernel got stuck, no network, disk got lost. ID: 52679 · Reply Quote

Glohr Send message Joined: 13 Jan 24 Posts: 45 Credit: 8,764,034 RAC: 16,388	Message 52680 - Posted: 20 Nov 2025, 11:54:06 UTC - in response to Message 52679. The count of failing tasks isn't particularly relevant to the question of resource consumption. Adding up the CPU seconds for the 224 valid Theory tasks on my account currently available through the web interface totals 1.9 million CPU seconds. The 9 error tasks used 2.25 million CPU seconds. Most of the error CPU seconds (1.765 million) were the two tasks recently referenced. My concern is that the code fails to detect and correctly handle an obvious error condition in a timely fashion. ID: 52680 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 300,025,523 RAC: 51,540	Message 52681 - Posted: 20 Nov 2025, 17:08:23 UTC - in response to Message 52680. There's no 'obvious error' reported back to the project. In cases like that there is no log file from the scientific app sent back to the project. Hence, there is nothing to analyse and the task is either marked as 'failed' or 'lost' after the due date. Even the log snippets you posted do not clearly explain if/why the tasks got stuck. So, how should the project decide what caused the failure. It could be either (may be incomplete): - hardware - the OS - VirtualBox - BOINC - vboxwrapper - data from CVMFS - scientific app From the project's perspective there's only the overall task failure rate for the computer itself. As already mentioned for this computer it is less than 1 % covering all possible reasons. ID: 52681 · Reply Quote