Message boards : Theory Application : Hung Theory task?
Message board moderation

To post messages, you must log in.

AuthorMessage
Glohr

Send message
Joined: 13 Jan 24
Posts: 42
Credit: 7,222,623
RAC: 17,981
Message 52602 - Posted: 30 Oct 2025, 6:07:37 UTC

I've had a CPU bound task https://lhcathome.cern.ch/lhcathome/result.php?resultid=429904502 running for days with no apparent forward progress having this initial log line:
===> [runRivet] Mon Oct 27 09:00:12 AM UTC 2025 [boinc pp z1j 7000 250 - sherpa 1.2.3 default 100000 346]

The tail of the log has been
Event 37100 ( 9m 41s elapsed / 16m 26s left ) -> ETA: Mon Oct 27 09:31
37100 events processed
Event 37200 ( 9m 43s elapsed / 16m 24s left ) -> ETA: Mon Oct 27 09:31
37200 events processed

for more than 2 days without change. I'll abort the task shortly to free up resources for other things.
My question is should I do anything differently? Is this an expected occurrence?
ID: 52602 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2715
Credit: 294,010,569
RAC: 154,307
Message 52603 - Posted: 30 Oct 2025, 6:25:17 UTC - in response to Message 52602.  

Yes, this task got stuck and should be cancelled.
Unlike other Theory tasks Sherpas are known to have a higher risk for this.
ID: 52603 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1924
Credit: 151,306,393
RAC: 145,286
Message 52661 - Posted: 15 Nov 2025, 15:09:07 UTC

I have had cases lately where Theory tasks get stuck, somehow. When opening the running log, at the bottom it says "0 events processed", even after long time.
example:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=430315743
too bad that I found out only after 10 days, so a lot of CPU time for nothing.
Any idea what's the problem?
ID: 52661 · Report as offensive     Reply Quote
Glohr

Send message
Joined: 13 Jan 24
Posts: 42
Credit: 7,222,623
RAC: 17,981
Message 52678 - Posted: 19 Nov 2025, 5:36:32 UTC - in response to Message 52661.  

At the moment I have two stuck Theory tasks. In BOINC/slots/{5|12}/shared runRivet.log has a timestamp a few minutes after the task started but the heartbeat file has a current as of a few minutes ago timestamp. One is a sherpa task, the other is a herwig++ task. There seems to be a bug in there somewhere that ought to be fixable. If nothing else, the heartbeat code needs to verify actual progress before setting the heartbeat. Ten days of wasted CPU resource for each task isn't necessary.

The first line of one runRivet.log:
===> [runRivet] Fri Nov 7 11:57:33 PM UTC 2025 [boinc pp winclusive 13000 - - sherpa 2.2.9 default 1000 363]

and the tail of that one:
+----------------------------------+
| |
| CCC OOO M M I X X |
| C O O MM MM I X X |
| C O O M M M I X |
| C O O M M I X X |
| CCC OOO M M I X X |
| |
+==================================+
| Color dressed Matrix Elements |
| http://comix.freacafe.de |
| please cite JHEP12(2008)039 |
+----------------------------------+
Matrix_Element_Handler::BuildProcesses(): Looking for processes .................................................................................................................................................................................... done ( 47 MB, 6s / 6s ).
Matrix_Element_Handler::InitializeProcesses(): Performing tests .................................................................................................................................................................................... done ( 47 MB, 0s / 0s ).
Initialized the Matrix_Element_Handler for the hard processes.
Initialized the Beam_Remnant_Handler.
Hadron_Decay_Map::Read: Initializing HadronDecays.dat. This may take some time.
Initialized the Hadron_Decay_Handler, Decay model = Hadrons
Initialized the Soft_Photon_Handler.
Variations::InitialiseParametersVector(0 variations){
Named variations:
}
Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__e-__veb' (Comix)
Starting the calculation at 23:59:46. Lean back and enjoy ... .


The other runRivet first line:
===> [runRivet] Sat Nov 8 01:53:31 AM UTC 2025 [boinc pp z1j 8000 - - herwig++ 2.5.2 default 100000 382]

and the tail of that one:
Run herwig++ 2.5.2 ...
generatorExecString = /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.2/x86_64-slc5-gcc43-opt/bin/Herwig++ read -r /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/herwig++/2.5.2/x86_64-slc5-gcc43-opt/share/Herwig++/HerwigDefaults.rpo /scratch/tmp/tmp.hVon2UA8R0/generator.params
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>> ThePEG - Toolkit for HEP Event Generation - version 1.7.2 <<<<<<<<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
** An event exception of type ThePEG::Exception occurred while generating event number 1:
Failed to generate the shower after 100 attempts in Evolver::showerHardProcess()
The event will be discarded.
No more warnings of this kind will be reported.


These are associated with:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237411761
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237545890
ID: 52678 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2715
Credit: 294,010,569
RAC: 154,307
Message 52679 - Posted: 19 Nov 2025, 7:17:01 UTC - in response to Message 52678.  

There's no reason to claim for a waste since your computer shows an overall fail ratio of less than 1 % in mcplots.
Even those long runners without a valid result are of scientific worth since they may indicate problematic parameter sets.

As for the heartbeat code:
It simply 'touches' the heartbeat file periodically via cron and never does any checks.
It's goal is to identify low level issues related to the VM itself, e.g. kernel got stuck, no network, disk got lost.
ID: 52679 · Report as offensive     Reply Quote
Glohr

Send message
Joined: 13 Jan 24
Posts: 42
Credit: 7,222,623
RAC: 17,981
Message 52680 - Posted: 20 Nov 2025, 11:54:06 UTC - in response to Message 52679.  

The count of failing tasks isn't particularly relevant to the question of resource consumption. Adding up the CPU seconds for the 224 valid Theory tasks on my account currently available through the web interface totals 1.9 million CPU seconds. The 9 error tasks used 2.25 million CPU seconds. Most of the error CPU seconds (1.765 million) were the two tasks recently referenced.

My concern is that the code fails to detect and correctly handle an obvious error condition in a timely fashion.
ID: 52680 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2715
Credit: 294,010,569
RAC: 154,307
Message 52681 - Posted: 20 Nov 2025, 17:08:23 UTC - in response to Message 52680.  

There's no 'obvious error' reported back to the project.
In cases like that there is no log file from the scientific app sent back to the project.
Hence, there is nothing to analyse and the task is either marked as 'failed' or 'lost' after the due date.

Even the log snippets you posted do not clearly explain if/why the tasks got stuck.

So, how should the project decide what caused the failure.
It could be either (may be incomplete):
- hardware
- the OS
- VirtualBox
- BOINC
- vboxwrapper
- data from CVMFS
- scientific app

From the project's perspective there's only the overall task failure rate for the computer itself.
As already mentioned for this computer it is less than 1 % covering all possible reasons.
ID: 52681 · Report as offensive     Reply Quote

Message boards : Theory Application : Hung Theory task?


©2025 CERN