Thread 'About the current Theory strategy. A significant percentage of potentially good WUs may be "burned" by bad hosts.'

Author	Message
metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53333 - Posted: 31 Mar 2026, 13:53:38 UTC Last modified: 31 Mar 2026, 13:55:24 UTC Hello Theory team! Right now, you are using a 3-try strategy (max_error_tasks=3) while the project uses Virtual Machine (VM) technology. From what I have noticed, many potentially valid Theory WUs are "burned" by "bad" hosts, just because the number of allowed tries is too small, IMHO. Here is an example – the failure of "Theory_2922-4903626-734": https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=240208680 Host 1: The task failed immediately because of a badly configured Podman (host permission issue). Host 2: It also failed immediately because of a badly configured Docker (host service issue). Host 3: Computations started and ran successfully, but the task died trying to wake up after a computer reboot. In summary, the WU failed because: two cases were badly configured host systems, and the third was VM instability. The WU itself was probably OK. A little bit of history for the end. I started 20 years ago when the Sixtrack simulation strategy was "Minimum quorum – 3" and "Initial replication – 5". How powerful were CPUs in 2006??? But I really don't remember anyone in the BOINC volunteer community making a noise that LHC@home was a resource-wasting project. Why not increase the Theory WU "max # of error/total/success tasks" right NOW? To 5, 10 or 20? Is this a technical problem? ID: 53333 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,127,458 RAC: 1,498	Message 53334 - Posted: 31 Mar 2026, 14:10:39 UTC - in response to Message 53333. LHC@home is not a resource wasting project. The badly configured client-hosts are wasting resourses for themselves. For LHC@home not a real problem. The workunts are just envelopes and the max # of error/total/success tasks could also be 1, 1, 1 ID: 53334 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2761 Credit: 307,186,427 RAC: 134,844	Message 53335 - Posted: 31 Mar 2026, 14:23:04 UTC - in response to Message 53333. What really counts is the success rate reported to mcplots. The runspec of the given WU is: pp zinclusive 13000 - - pythia8 8.240 tune-4c 100000 734 Although from the BOINC perspective a complete WU has been lost that runspec has an mcplots success rate of 75 % and successfully processed 600000 events. This is good enough to mark it a success from the science perspective. Hence, it doesn't need additional processing. ID: 53335 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53336 - Posted: 31 Mar 2026, 14:38:56 UTC - in response to Message 53334. In reply to Crystal Pellet's message of 31 Mar 2026: LHC@home is not a resource wasting project. Excuse me, but where exactly did I write that "LHC@home is a resource-wasting project"? I was saying the exact opposite: "I don't understand why you are saving/restricting these resources (the retry limits) so drastically, IMHO." By the way, Host 3 was mine. When that task died, I didn't feel like I had contributed to science. On the contrary, I felt like I had harmed it. This is a volunteer's frustration – just try to understand... ID: 53336 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53337 - Posted: 31 Mar 2026, 14:44:12 UTC - in response to Message 53335. In reply to computezrmle's message of 31 Mar 2026: What really counts is the success rate reported to mcplots. Thank you! This is the exact piece of information that was missing. Knowing that the intermediate progress (the 75% / 600,000 events) was successfully reported to mcplots and used for science completely changes the perspective. As a volunteer, seeing a "Computation Error" in BOINC makes you feel like you wasted electricity and let the project down. If the science is safe and the data was collected despite the VM crash, then my frustration was indeed misplaced. I appreciate you taking the time to check the actual runspec and clarify this. Keep up the good work! ID: 53337 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2761 Credit: 307,186,427 RAC: 134,844	Message 53338 - Posted: 31 Mar 2026, 15:00:40 UTC - in response to Message 53337. I'm not sure if you misunderstood the runspec success rate. It means: - that runspec has been used to create 8 BOINC WUs - 1 WU (yours) failed completely (3 failed tasks in total) for whatever reason mcplots ignores them for science - 1 WU has been marked "lost", which usually means not returned before an mcplots due date (NOT a BOINC due date!) - 6 WUs (from other hosts) returned a BOINC success and were also marked as success by mcplots Only the last ones are used for science. ID: 53338 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53339 - Posted: 31 Mar 2026, 15:43:15 UTC - in response to Message 53338. In reply to computezrmle's message of 31 Mar 2026: I'm not sure if you misunderstood the runspec success rate. You are right - I imagined that your tasks use a trickle-up tactic, used in projects with long-run tasks, where intermediate results are saved and maybe even go to the scientific DB. I understand finally - you do it differently. But... I am really grateful to you. Your explanation takes away the biggest headache. This is a surprise for me! :) The nastiest case is (or rather - was) a task like this: its suffix is "_2", meaning the last WU try; progress is 98+ percent, but the further it goes, the slower it moves; stderr.txt shows signs of a "zombie"... Now I know what to do - hit it on the head without any hesitation... :) ID: 53339 · Reply Quote

Toggleton Send message Joined: 4 Mar 17 Posts: 46 Credit: 13,036,976 RAC: 644	Message 53342 - Posted: 31 Mar 2026, 18:18:09 UTC - in response to Message 53339. Last modified: 31 Mar 2026, 18:25:40 UTC progress is 98+ percent, but the further it goes, the slower it moves; stderr.txt shows signs of a "zombie"... The progress bar in boincManager is not useful for most LHC tasks. Looking at the tasks you have finished so far, you had mostly under 10hour long tasks. But Theory tasks can be way longer than that. So you need to check if they still write to the log. I think windows virtualbox task work with the [show graphics] button in the boinc manager where you can then look at the logs in the browser. Don't think stderr is really useful, i check on linux always the boinc/slot/x/shared/runRivet.log where it is counting up from Integrate 1 of 760: and in a second step 99000 events processed (the number of events of the second step can be different than the normal 100 000) the http://localhost:123456/logs/running.log from the [Show Graphics] button should show the same ===> [runRivet] Tue Mar 31 07:45:45 UTC 2026 [boinc pp z1j 13000 150 - herwig7 7.2.1 nlo-pw 35000 699] So my task here should do only 35 000 events after the "integrate 1 of 760" finished. Such Herwig tasks can take multiple days(up to 10days in rare cases) Check first if they are really zombies or just long runner that still count up in the logs. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6251 Here is the herwig care taking forum thread. ID: 53342 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53343 - Posted: 31 Mar 2026, 18:53:47 UTC - in response to Message 53342. In reply to Toggleton's message of 31 Mar 2026: The progress bar in boincManager is not useful for most LHC tasks... Thank You for the "stethoscope". :) I know about those BM progress bar hallucinations, but I will definitely try Your recommendations. I am making notes. :) Thank You again! ID: 53343 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53361 - Posted: 2 Apr 2026, 11:43:56 UTC - in response to Message 53342. Last modified: 2 Apr 2026, 11:50:52 UTC In reply to Toggleton's message of 31 Mar 2026: I think windows virtualbox task work with the [show graphics] button in the boinc manager where you can then look at the logs in the browser. [/quote] Hello! The "stethoscope" works perfectly. The best way, IMHO, is using "Show graphics" because you can watch the logs almost in real-time. By watching them, I become "smarter" not by the day, but by the hour. I thank the Theory team members who made this tool. The "stethoscope" works for all tasks except POWHEG. Unlike the friendly Sherpa, which always writes "Lean back and enjoy...", POWHEG is arrogant and silent, like David Anderson: "Don't you see that I am calculating? It is not for you – an earthworm – to understand this super-high mathematics that I am running here. And it is not your business how I am doing this and that." :) ID: 53361 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53362 - Posted: 2 Apr 2026, 14:01:58 UTC Hello again! I started this thread, so I will keep writing here until I am bored or an admin hits me with a brick... :) I am not sure yet, but... It is about Pythia 8. I needed to reboot my computer. I was sure that I did it safely: I suspended all Theory tasks (all active tasks were Pythia 8). I waited until all VBox processes were closed and gone from the system. Then I rebooted. After the reboot, I started to wake them up one by one to avoid disk I/O or other possible problems. The BOINC manager showed that everything was great – the progress bars continued to grow. Victory? Unfortunately, no. I checked the logs and... I saw that all Pythia 8 tasks heroically started from ZERO. As I said, I am still checking this. But if Pythia 8 has no hard checkpoints, this is a real "hemorrhoid" for the volunteer. If all types of tasks do the same, then [censored]... :) ID: 53362 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,127,458 RAC: 1,498	Message 53364 - Posted: 2 Apr 2026, 15:31:30 UTC - in response to Message 53362. Last modified: 2 Apr 2026, 15:44:56 UTC In reply to metalius's message of 2 Apr 2026: . . I needed to reboot my computer. I was sure that I did it safely: I suspended all Theory tasks (all active tasks were Pythia 8). I waited until all VBox processes were closed and gone from the system. Then I rebooted. How safely? One should suspend the running tasks one after each other. To be sure the sequence is: 1. Tick off in BOINC Manager in computing preferences: "Leave non-GPU tasks in memory while suspended" So you will force the VM-task to save the state to disk and not keep the suspended task in memory. 2. Suspend all not yet started tasks. 3. Suspend 1 running task. 4. Open VirtualBox Manager -Tab Machines (depending on VirtulaBox version) and wait until the VM of the suspended task is saved to disk. When the VM gets the stopped state and not the saved state, the task will start from scratch or even will gives a computation error after a resume. Repeat 3 and 4 for each running task. ID: 53364 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53365 - Posted: 2 Apr 2026, 16:46:24 UTC - in response to Message 53364. In reply to Crystal Pellet's message of 2 Apr 2026: 4. Open VirtualBox Manager -Tab Machines (depending on VirtulaBox version) and wait until the VM of the suspended task is saved to disk. When the VM gets the stopped state and not the saved state, the task will start from scratch or even will gives a computation error after a resume. Thank you very much. I will try Step 4 – I did not know about this "feature". But I have a bad feeling: "Saved" status might not show up instead of "Stopped". Likely this happened during my reboot. And I will be powerless to do anything... Ok... At least 10% less "hemorrhoid" now. This is from knowing: "Don't stress, don't go crazy. It will not be different..." :) To make a bit fun... I see BOINC manager crazy hallucinations. BM shows 99% progress and it is stuck. But reality in logs is less than 50%. BM has no idea the task started for the second time. Before I knew about logs, I killed maybe 5 tasks like this. I thought they were "zombies" in an infinite loop. :) Thanks to the good spirit who taught me how to use logs! ID: 53365 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53509 - Posted: 30 Apr 2026, 20:28:29 UTC https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=240585354 I have found a serious demotivator – "Invalid" marks for results. This always happens in this exact situation: Initial host misses the deadline. The system sends the replicated task to a rescuer host to save it. The late initial host finally finishes the task and uploads it before the rescuer does. Look at the example in the link: the initial host missed the deadline by about a week. But it managed to upload the result just 2 hours before the rescuer finished it. Because the late host was slightly faster at the very end, the second result was marked as "Invalid". It is a very demotivating situation. After 6 days of permanent work, after finishing the task in time, a probably correct result was just thrown into the trash bin. Why possibly correct results are thrown like this? Here are more examples. As proof that the scenario is always the same: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=240628537 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=240105688 ID: 53509 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2761 Credit: 307,186,427 RAC: 134,844	Message 53520 - Posted: 5 May 2026, 5:06:53 UTC - in response to Message 53509. Thanks for reporting. This should not happen any more for new Theory WUs since "max # of success tasks" is now set to 3 (was 1). Nonetheless, since a few thousand WUs with the previous setting are already in progress it will take a while (roughly up to a month) to get those done. ID: 53520 · Reply Quote

metalius Send message Joined: 3 Oct 06 Posts: 122 Credit: 9,378,126 RAC: 8,193	Message 53538 - Posted: 6 May 2026, 21:54:35 UTC - in response to Message 53520. In reply to computezrmle's message of 5 May 2026: Thanks for reporting... Thank You for reacting and for changing the setting. :) ID: 53538 · Reply Quote