Message boards : Theory Application : Some Theory tasks on VirtualBox hang Probing /cvmfs/alice.cern.ch...
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 13 Jan 24 Posts: 48 Credit: 9,502,684 RAC: 17,990 |
Some Theory tasks hang with the last thing on the screen "Probing /cvmfs/alice.cern.ch... " Most tasks continue on and eventually exit normally, but some just sit there never getting the "OK" and so on. The problem tasks don't accumulate any Guest CPU time in VirtualBox after the initial phase. The VM continues to accumulate a few seconds of CPU time per hour, apparently for housekeeping. Today, all the running.log files in problem tasks visible through the Web application end with INFO: index summary: size / pathand contain a line /cvmfs/sft.cern.ch/lcg/releases/LCG_88b/MCGenerators/pythia8/306/x86_64-centos7-gcc62-opt/pythia8env-genser.sh: line 49: python: command not found All the successful tasks that I checked were in a different tree and were using CPU time while processing events until they finished and exited. In the past I've let a couple of the problem tasks go until they timed out after 10 days and exited with "Error while computing" status. That seems rather pointless, so I've been aborting any that I notice going nowhere rather than leaving the dog in the manger blocking other work. Here are a couple of the problem tasks: https://lhcathome.cern.ch/lhcathome/result.php?resultid=421021497 https://lhcathome.cern.ch/lhcathome/result.php?resultid=421076672 Is there any way to avoid these, or at least kill them off quickly, automatically?
|
|
Send message Joined: 13 Jan 24 Posts: 48 Credit: 9,502,684 RAC: 17,990 |
All of the problem tasks have a running.log that begins with the same line with the same old timestamp: ===> [runRivet] Wed Apr 2 05:28:25 PM UTC 2025 [boinc pp jets 13000 280 - pythia8 8.306 eetherm 100000 42]I've killed off a couple more today. https://lhcathome.cern.ch/lhcathome/result.php?resultid=421143601 https://lhcathome.cern.ch/lhcathome/result.php?resultid=421141211
|
|
Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,100,748 RAC: 1,717 |
All of the problem tasks have a running.log that begins with the same line with the same old timestamp:===> [runRivet] Wed Apr 2 05:28:25 PM UTC 2025 [boinc pp jets 13000 280 - pythia8 8.306 eetherm 100000 42]I've killed off a couple more today. That is very weird. It looks like you're not using the default vdi, when different tasks come with the same job desciption (remnant of an old log?). You could consider to reset the LHC project in your BOINC Manager. |
|
Send message Joined: 17 Oct 06 Posts: 99 Credit: 65,519,322 RAC: 13,507 |
This happened to me as well but a project reset appears to have fixed it. |
|
Send message Joined: 3 Oct 06 Posts: 116 Credit: 9,194,472 RAC: 5,992 |
Pythia 8... So far, I think I caught and smashed 2 zombie tasks with a brick. There were two symptoms: 1. The log completely stopped moving. 2. In the task's VM activity, I saw that VMM is using 70% CPU and the Guest only 5% (on a normally working machine, those numbers are 5% and 95%). Both logs of these zombies (as I believe them to be) were identical, stuck exactly between these lines: AlmaLinux release 9.6 (Sage Margay) ... envscript=/cvmfs/sft.cern.ch/lcg/releases/LCG_96/MCGenerators/pythia8/301/x86_64-centos7-gcc8-opt/pythia8env-genser.sh For now, this is reptiloid language to me: I know absolutely nothing about Alma Linux 9 and "7 cents". And I never drank beer with Mr. Genser... :)
|
|
Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,085,653 RAC: 113,053 |
In reply to metalius's message of 4 Apr 2026: Pythia 8... This is running inside the VMs. Since you can't influence it, it doesn't make sense to dig deeper now. Instead You are running VirtualBox 6.1.34/7.0.6 which are both out of maintenance for at least a year. Please consider to upgrade to the most recent version, currently 7.2.26. Before you run the upgrade, finish all work in progress and ensure no VM is running or in saved state. In addition Your computers report only 8/16 GB RAM. Ensure you do not overcommit memory usage. Whenever your computers suspend/resume a VM this is an extremely heavy task. When you notice lots of this in stderr.txt consider to reduce the #tasks/projects you run concurrently. |
|
Send message Joined: 3 Oct 06 Posts: 116 Credit: 9,194,472 RAC: 5,992 |
In reply to computezrmle's message of 4 Apr 2026: Please consider to upgrade to the most recent version, currently 7.2.26. Believe or not, but VBox 6 looks more stable to me NOW than 7. 6.1.34 never went crazy on my work PC, which is used for real everyday jobs. At the same time, 7.0.6 loses its mind about once a week and starts mass-generating "Computation Error". And this happens on a PC that does absolutely nothing except processing for BOINC. IMHO, the safest way to not smash my keyboard is to install a combined package. I have BOINC 8.0.2 + VBox in my archive. I will try it, thank You for this idea!
|
|
Send message Joined: 3 Oct 06 Posts: 116 Credit: 9,194,472 RAC: 5,992 |
MadGraph5 These types of tasks are rare, but I see the same scenario on one of my PCs. 1. For about half an hour, there is some activity in the VM: Guest shows some CPU load, it downloads around 200 MB, and writes about 100 MB to the disk. 2. After that, Guest activity drops to zero, all performance graphs are flat, and the task log file becomes unreachable. Here is an excerpt from the log: ===> [runRivet] Wed Apr 8 07:03:39 AM UTC 2026 [boinc pp zinclusive 13000 -,-,200 - madgraph5amc 2.7.2.atlas3...] ... ValueError: unsupported hash type md5 AttributeError : 'module' object has no attribute 'md5' ... ERROR: missing LHE output file: /scratch/tmp/tmp.UjcDJkl2wh/MG5RUN/Events/run_01/unweighted_events.lhe ... [2]+ 2535 Running ( $rivetExecString; exit $? ) & ERROR: fail to run madgraph5amc 2.7.2.atlas3 or Rivet (error exit code) My preliminary, but not yet firm conclusion: The simulation never started because the angry Python strangled everything, maybe even itself. :) My action: Brick therapy for that VM. P.S. For those who are experienced or have real knowledge – please comment, is this "Brick therapy" a correct solution?
|
©2026 CERN