Message boards : Number crunching : About the current Theory strategy. A significant percentage of potentially good WUs may be "burned" by bad hosts.
Message board moderation

To post messages, you must log in.

AuthorMessage
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53333 - Posted: 31 Mar 2026, 13:53:38 UTC
Last modified: 31 Mar 2026, 13:55:24 UTC

Hello Theory team!

Right now, you are using a 3-try strategy (max_error_tasks=3) while the project uses Virtual Machine (VM) technology. From what I have noticed, many potentially valid Theory WUs are "burned" by "bad" hosts, just because the number of allowed tries is too small, IMHO.

Here is an example – the failure of "Theory_2922-4903626-734":
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=240208680

Host 1: The task failed immediately because of a badly configured Podman (host permission issue).
Host 2: It also failed immediately because of a badly configured Docker (host service issue).
Host 3: Computations started and ran successfully, but the task died trying to wake up after a computer reboot.

In summary, the WU failed because: two cases were badly configured host systems, and the third was VM instability. The WU itself was probably OK.

A little bit of history for the end.
I started 20 years ago when the Sixtrack simulation strategy was "Minimum quorum – 3" and "Initial replication – 5". How powerful were CPUs in 2006??? But I really don't remember anyone in the BOINC volunteer community making a noise that LHC@home was a resource-wasting project.

Why not increase the Theory WU "max # of error/total/success tasks" right NOW? To 5, 10 or 20? Is this a technical problem?
ID: 53333 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1551
Credit: 10,067,859
RAC: 801
Message 53334 - Posted: 31 Mar 2026, 14:10:39 UTC - in response to Message 53333.  

LHC@home is not a resource wasting project.
The badly configured client-hosts are wasting resourses for themselves.
For LHC@home not a real problem. The workunts are just envelopes and the max # of error/total/success tasks could also be 1, 1, 1
ID: 53334 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2739
Credit: 301,811,487
RAC: 83,089
Message 53335 - Posted: 31 Mar 2026, 14:23:04 UTC - in response to Message 53333.  

What really counts is the success rate reported to mcplots.

The runspec of the given WU is:
pp zinclusive 13000 - - pythia8 8.240 tune-4c 100000 734

Although from the BOINC perspective a complete WU has been lost that runspec has an mcplots success rate of 75 % and successfully processed 600000 events.
This is good enough to mark it a success from the science perspective.
Hence, it doesn't need additional processing.
ID: 53335 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53336 - Posted: 31 Mar 2026, 14:38:56 UTC - in response to Message 53334.  

In reply to Crystal Pellet's message of 31 Mar 2026:
LHC@home is not a resource wasting project.


Excuse me, but where exactly did I write that "LHC@home is a resource-wasting project"?
I was saying the exact opposite: "I don't understand why you are saving/restricting these resources (the retry limits) so drastically, IMHO."

By the way, Host 3 was mine.
When that task died, I didn't feel like I had contributed to science. On the contrary, I felt like I had harmed it.
This is a volunteer's frustration – just try to understand...
ID: 53336 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53337 - Posted: 31 Mar 2026, 14:44:12 UTC - in response to Message 53335.  

In reply to computezrmle's message of 31 Mar 2026:
What really counts is the success rate reported to mcplots.


Thank you!

This is the exact piece of information that was missing.
Knowing that the intermediate progress (the 75% / 600,000 events) was successfully reported to mcplots and used for science completely changes the perspective.
As a volunteer, seeing a "Computation Error" in BOINC makes you feel like you wasted electricity and let the project down.
If the science is safe and the data was collected despite the VM crash, then my frustration was indeed misplaced.

I appreciate you taking the time to check the actual runspec and clarify this.
Keep up the good work!
ID: 53337 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2739
Credit: 301,811,487
RAC: 83,089
Message 53338 - Posted: 31 Mar 2026, 15:00:40 UTC - in response to Message 53337.  

I'm not sure if you misunderstood the runspec success rate.

It means:
- that runspec has been used to create 8 BOINC WUs
- 1 WU (yours) failed completely (3 failed tasks in total) for whatever reason
mcplots ignores them for science
- 1 WU has been marked "lost", which usually means not returned before an mcplots due date (NOT a BOINC due date!)
- 6 WUs (from other hosts) returned a BOINC success and were also marked as success by mcplots

Only the last ones are used for science.
ID: 53338 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53339 - Posted: 31 Mar 2026, 15:43:15 UTC - in response to Message 53338.  

In reply to computezrmle's message of 31 Mar 2026:
I'm not sure if you misunderstood the runspec success rate.

You are right - I imagined that your tasks use a trickle-up tactic, used in projects with long-run tasks, where intermediate results are saved and maybe even go to the scientific DB.
I understand finally - you do it differently.

But... I am really grateful to you.
Your explanation takes away the biggest headache. This is a surprise for me! :)

The nastiest case is (or rather - was) a task like this:
its suffix is "_2", meaning the last WU try;
progress is 98+ percent, but the further it goes, the slower it moves;
stderr.txt shows signs of a "zombie"...

Now I know what to do - hit it on the head without any hesitation... :)
ID: 53339 · Report as offensive     Reply Quote
Toggleton

Send message
Joined: 4 Mar 17
Posts: 44
Credit: 12,909,202
RAC: 7,318
Message 53342 - Posted: 31 Mar 2026, 18:18:09 UTC - in response to Message 53339.  
Last modified: 31 Mar 2026, 18:25:40 UTC

progress is 98+ percent, but the further it goes, the slower it moves;
stderr.txt shows signs of a "zombie"...

The progress bar in boincManager is not useful for most LHC tasks. Looking at the tasks you have finished so far, you had mostly under 10hour long tasks. But Theory tasks can be way longer than that. So you need to check if they still write to the log.

I think windows virtualbox task work with the [show graphics] button in the boinc manager where you can then look at the logs in the browser.

Don't think stderr is really useful, i check on linux always the boinc/slot/x/shared/runRivet.log where it is counting up from Integrate 1 of 760: and in a second step 99000 events processed (the number of events of the second step can be different than the normal 100 000) the http://localhost:123456/logs/running.log from the [Show Graphics] button should show the same

===> [runRivet] Tue Mar 31 07:45:45 UTC 2026 [boinc pp z1j 13000 150 - herwig7 7.2.1 nlo-pw 35000 699]
So my task here should do only 35 000 events after the "integrate 1 of 760" finished.

Such Herwig tasks can take multiple days(up to 10days in rare cases) Check first if they are really zombies or just long runner that still count up in the logs. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6251 Here is the herwig care taking forum thread.
ID: 53342 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53343 - Posted: 31 Mar 2026, 18:53:47 UTC - in response to Message 53342.  

In reply to Toggleton's message of 31 Mar 2026:
The progress bar in boincManager is not useful for most LHC tasks...

Thank You for the "stethoscope". :)
I know about those BM progress bar hallucinations, but I will definitely try Your recommendations. I am making notes. :)
Thank You again!
ID: 53343 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53361 - Posted: 2 Apr 2026, 11:43:56 UTC - in response to Message 53342.  
Last modified: 2 Apr 2026, 11:50:52 UTC

In reply to Toggleton's message of 31 Mar 2026:
I think windows virtualbox task work with the [show graphics] button in the boinc manager where you can then look at the logs in the browser.
[/quote]

Hello!
The "stethoscope" works perfectly.
The best way, IMHO, is using "Show graphics" because you can watch the logs almost in real-time. By watching them, I become "smarter" not by the day, but by the hour. I thank the Theory team members who made this tool.
The "stethoscope" works for all tasks except POWHEG.
Unlike the friendly Sherpa, which always writes "Lean back and enjoy...", POWHEG is arrogant and silent, like David Anderson: "Don't you see that I am calculating? It is not for you – an earthworm – to understand this super-high mathematics that I am running here. And it is not your business how I am doing this and that." :)
ID: 53361 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53362 - Posted: 2 Apr 2026, 14:01:58 UTC

Hello again!

I started this thread, so I will keep writing here until I am bored or an admin hits me with a brick... :)
I am not sure yet, but... It is about Pythia 8.

I needed to reboot my computer. I was sure that I did it safely: I suspended all Theory tasks (all active tasks were Pythia 8). I waited until all VBox processes were closed and gone from the system. Then I rebooted.
After the reboot, I started to wake them up one by one to avoid disk I/O or other possible problems.
The BOINC manager showed that everything was great – the progress bars continued to grow.
Victory? Unfortunately, no.
I checked the logs and... I saw that all Pythia 8 tasks heroically started from ZERO.

As I said, I am still checking this. But if Pythia 8 has no hard checkpoints, this is a real "hemorrhoid" for the volunteer. If all types of tasks do the same, then [censored]... :)
ID: 53362 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1551
Credit: 10,067,859
RAC: 801
Message 53364 - Posted: 2 Apr 2026, 15:31:30 UTC - in response to Message 53362.  
Last modified: 2 Apr 2026, 15:44:56 UTC

In reply to metalius's message of 2 Apr 2026:
.
.
I needed to reboot my computer. I was sure that I did it safely: I suspended all Theory tasks (all active tasks were Pythia 8). I waited until all VBox processes were closed and gone from the system. Then I rebooted.
How safely?
One should suspend the running tasks one after each other.
To be sure the sequence is:
1. Tick off in BOINC Manager in computing preferences: "Leave non-GPU tasks in memory while suspended"
So you will force the VM-task to save the state to disk and not keep the suspended task in memory.
2. Suspend all not yet started tasks.
3. Suspend 1 running task.
4. Open VirtualBox Manager -Tab Machines (depending on VirtulaBox version) and wait until the VM of the suspended task is saved to disk.
When the VM gets the stopped state and not the saved state, the task will start from scratch or even will gives a computation error after a resume.
Repeat 3 and 4 for each running task.
ID: 53364 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 3 Oct 06
Posts: 115
Credit: 9,052,815
RAC: 2,552
Message 53365 - Posted: 2 Apr 2026, 16:46:24 UTC - in response to Message 53364.  

In reply to Crystal Pellet's message of 2 Apr 2026:
4. Open VirtualBox Manager -Tab Machines (depending on VirtulaBox version) and wait until the VM of the suspended task is saved to disk.
When the VM gets the stopped state and not the saved state, the task will start from scratch or even will gives a computation error after a resume.

Thank you very much. I will try Step 4 – I did not know about this "feature".
But I have a bad feeling: "Saved" status might not show up instead of "Stopped". Likely this happened during my reboot. And I will be powerless to do anything...

Ok... At least 10% less "hemorrhoid" now. This is from knowing: "Don't stress, don't go crazy. It will not be different..." :)
To make a bit fun... I see BOINC manager crazy hallucinations. BM shows 99% progress and it is stuck. But reality in logs is less than 50%. BM has no idea the task started for the second time. Before I knew about logs, I killed maybe 5 tasks like this. I thought they were "zombies" in an infinite loop. :)

Thanks to the good spirit who taught me how to use logs!
ID: 53365 · Report as offensive     Reply Quote

Message boards : Number crunching : About the current Theory strategy. A significant percentage of potentially good WUs may be "burned" by bad hosts.


©2026 CERN