Message boards :
Theory Application :
Hung Theory task?
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 13 Jan 24 Posts: 42 Credit: 7,222,623 RAC: 17,981 |
I've had a CPU bound task https://lhcathome.cern.ch/lhcathome/result.php?resultid=429904502 running for days with no apparent forward progress having this initial log line: ===> [runRivet] Mon Oct 27 09:00:12 AM UTC 2025 [boinc pp z1j 7000 250 - sherpa 1.2.3 default 100000 346] The tail of the log has been Event 37100 ( 9m 41s elapsed / 16m 26s left ) -> ETA: Mon Oct 27 09:31 for more than 2 days without change. I'll abort the task shortly to free up resources for other things. My question is should I do anything differently? Is this an expected occurrence?
|
|
Send message Joined: 15 Jun 08 Posts: 2715 Credit: 294,010,569 RAC: 154,307 |
Yes, this task got stuck and should be cancelled. Unlike other Theory tasks Sherpas are known to have a higher risk for this. |
|
Send message Joined: 18 Dec 15 Posts: 1924 Credit: 151,306,393 RAC: 145,286 |
I have had cases lately where Theory tasks get stuck, somehow. When opening the running log, at the bottom it says "0 events processed", even after long time. example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=430315743 too bad that I found out only after 10 days, so a lot of CPU time for nothing. Any idea what's the problem? |
|
Send message Joined: 13 Jan 24 Posts: 42 Credit: 7,222,623 RAC: 17,981 |
At the moment I have two stuck Theory tasks. In BOINC/slots/{5|12}/shared runRivet.log has a timestamp a few minutes after the task started but the heartbeat file has a current as of a few minutes ago timestamp. One is a sherpa task, the other is a herwig++ task. There seems to be a bug in there somewhere that ought to be fixable. If nothing else, the heartbeat code needs to verify actual progress before setting the heartbeat. Ten days of wasted CPU resource for each task isn't necessary. The first line of one runRivet.log: ===> [runRivet] Fri Nov 7 11:57:33 PM UTC 2025 [boinc pp winclusive 13000 - - sherpa 2.2.9 default 1000 363] and the tail of that one: +----------------------------------+ The other runRivet first line: ===> [runRivet] Sat Nov 8 01:53:31 AM UTC 2025 [boinc pp z1j 8000 - - herwig++ 2.5.2 default 100000 382] and the tail of that one: Run herwig++ 2.5.2 ... These are associated with: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237411761 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237545890 |
|
Send message Joined: 15 Jun 08 Posts: 2715 Credit: 294,010,569 RAC: 154,307 |
There's no reason to claim for a waste since your computer shows an overall fail ratio of less than 1 % in mcplots. Even those long runners without a valid result are of scientific worth since they may indicate problematic parameter sets. As for the heartbeat code: It simply 'touches' the heartbeat file periodically via cron and never does any checks. It's goal is to identify low level issues related to the VM itself, e.g. kernel got stuck, no network, disk got lost. |
|
Send message Joined: 13 Jan 24 Posts: 42 Credit: 7,222,623 RAC: 17,981 |
The count of failing tasks isn't particularly relevant to the question of resource consumption. Adding up the CPU seconds for the 224 valid Theory tasks on my account currently available through the web interface totals 1.9 million CPU seconds. The 9 error tasks used 2.25 million CPU seconds. Most of the error CPU seconds (1.765 million) were the two tasks recently referenced. My concern is that the code fails to detect and correctly handle an obvious error condition in a timely fashion. |
|
Send message Joined: 15 Jun 08 Posts: 2715 Credit: 294,010,569 RAC: 154,307 |
There's no 'obvious error' reported back to the project. In cases like that there is no log file from the scientific app sent back to the project. Hence, there is nothing to analyse and the task is either marked as 'failed' or 'lost' after the due date. Even the log snippets you posted do not clearly explain if/why the tasks got stuck. So, how should the project decide what caused the failure. It could be either (may be incomplete): - hardware - the OS - VirtualBox - BOINC - vboxwrapper - data from CVMFS - scientific app From the project's perspective there's only the overall task failure rate for the computer itself. As already mentioned for this computer it is less than 1 % covering all possible reasons. |
©2025 CERN