Message boards : ATLAS application : Computation Errors
Message board moderation

To post messages, you must log in.

AuthorMessage
TheClockworkPirate

Send message
Joined: 1 Oct 24
Posts: 4
Credit: 153,653
RAC: 0
Message 50682 - Posted: 2 Oct 2024, 8:40:23 UTC

Hi folks,

New to the project, hit a few snags trying to start out (not having CVMFS set up properly etc) that caused a few tasks to fail and report computation error. Managed to work out the kinks with a combination of the forums and the messages in the stderr files.

That said, I'm still getting computation errors on tasks like https://lhcathome.cern.ch/lhcathome/result.php?resultid=414674356and I can't work out whether I've missed something or whether there is an issue with the task / WU.

Can anyone help shed some light?

Thanks!
ID: 50682 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2626
Credit: 266,225,201
RAC: 126,229
Message 50696 - Posted: 4 Oct 2024, 7:50:34 UTC - in response to Message 50682.  

You may ...
1. Make your computers visible for other volunteers to allow them see the complete picture.
2. Post your boinc client service unit file (plus the override.conf if you use one)
ID: 50696 · Report as offensive     Reply Quote
TheClockworkPirate

Send message
Joined: 1 Oct 24
Posts: 4
Credit: 153,653
RAC: 0
Message 50698 - Posted: 4 Oct 2024, 8:18:42 UTC - in response to Message 50696.  

Sure, the computer should now be visible to everyone https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10860373

Here are the files:
boinc-client.service:
[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
Wants=vboxdrv.service
After=vboxdrv.service network-online.target

[Service]
Type=simple
ProtectHome=true
ProtectSystem=full
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true
#PrivateTmp=true  #Block X11 idle detection

[Install]
WantedBy=multi-user.target


boinc-client.service.d/wayland-syslog-spam.conf:
[Service]
LogFilterPatterns=~no authorization protocol specified



global_prefs_override.xml:
<global_preferences>
    <run_if_user_active>1</run_if_user_active>
    <run_gpu_if_user_active>1</run_gpu_if_user_active>
</global_preferences>


cc_config.xml:
<cc_config>
    <options>
        <report_results_immediately>1</report_results_immediately>
        <use_all_gpus>1</use_all_gpus>
    </options>
</cc_config>
ID: 50698 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2626
Credit: 266,225,201
RAC: 126,229
Message 50699 - Posted: 4 Oct 2024, 8:42:47 UTC - in response to Message 50698.  

Try this

Create an override file containing either this:
[Service]
ProtectHome=false
ProtectControlGroups=false


or this:
[Service]
ProtectHome=false
ProtectSystem=strict
ProtectControlGroups=false
ReadWritePaths=-/tmp



Then
- stop BOINC
- run "sudo systemctl daemon-reload"
- restart BOINC
- run 1 native task (more if that one succeeds)
ID: 50699 · Report as offensive     Reply Quote
TheClockworkPirate

Send message
Joined: 1 Oct 24
Posts: 4
Credit: 153,653
RAC: 0
Message 50749 - Posted: 9 Oct 2024, 6:50:04 UTC - in response to Message 50699.  

Sorry about the delay in reporting back. Thanks for the suggestions, unfortunately neither of the changes / add-ons to the systemd service file made any difference, and ATLAS native tasks still failed with a computation error after ~10 minutes.

For good measure, I double checked the cvmfs set up and that looks fine.

% cvmfs_config probe
Probing /cvmfs/atlas.cern.ch... OK
Probing /cvmfs/atlas-condb.cern.ch... OK
Probing /cvmfs/grid.cern.ch... OK
Probing /cvmfs/cernvm-prod.cern.ch... OK
Probing /cvmfs/sft.cern.ch... OK
Probing /cvmfs/alice.cern.ch... OK


% cvmfs_config stat

Running /usr/bin/cvmfs_config stat cvmfs-config.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.11.5.0 253329 8 33436 35 3 1 2783123 6144000 0 130560 0 198 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/cvmfs-config.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat atlas.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.11.5.0 253439 8 38924 138133 3 16 2783123 6144000 0 130560 0 9380 100.000 1 0 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat atlas-condb.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.11.5.0 253581 8 33484 13795 3 1 2783123 6144000 0 130560 0 2 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas-condb.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat sft.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.11.5.0 254069 8 33532 30674 3 1 2783123 6144000 0 130560 0 1 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/sft.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat grid.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.11.5.0 258058 6 33420 25433 1 1 2783123 6144000 0 130560 0 0 0.000 7847 1646 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat cernvm-prod.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.11.5.0 258200 6 33728 272 1 1 2783123 6144000 0 130560 0 0 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/cernvm-prod.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat alice.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.11.5.0 258418 6 46104 19995 1 1 2783123 6144000 0 523776 0 0 0.000 6156 824 http://s1cern-cvmfs.openhtc.io/cvmfs/alice.cern.ch DIRECT 1


I did some further digging and I think I've found something in the journal:
Oct 09 07:16:54 mummu boinc[244621]: mv: cannot stat 'slots/3/shared/HITS.pool.root.1': No such file or directory


Looking back, lines like this coincide with ATLAS tasks finishing with a computation error. I'm guessing I initially missed it down to the line not containing the [LHC@home] project string.

Any suggestions as to why this HITS.pool.root.1 file can't be found? What is this file for?
ID: 50749 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2626
Credit: 266,225,201
RAC: 126,229
Message 50750 - Posted: 9 Oct 2024, 8:04:37 UTC - in response to Message 50749.  

Your CVMFS works fine.
Nonetheless you may purge it's local cache running "cvmfs_config wipecache".

HITS.pool.root.1 is the temporary name of the result file.
Since your tasks fail early they don't create that file which causes stat to fail.

Just to be sure ...,
- check if your boinc user is a member of the apptainer group.
- run "cat /proc/sys/user/max_user_namespaces" to check if (and how many) user namespaces are enabled.


The command having a way too short runtime and returning error 64 is this (so far no idea why):
/usr/bin/time -o /var/lib/boinc/slots/3/S9hMDmSYbI6nsSi4ap6QjLDmwznN0nGgGQJmDC4LDmmdvKDm9ycesn.diag -a -f 'WallTime=%es\nKernelTime=%Ss\nUserTime=%Us\nCPUUsage=%P\nMaxResidentMemory=%MkB\nAverageResidentMemory=%tkB\nAverageTotalMemory=%KkB\nAverageUnsharedMemory=%DkB\nAverageUnsharedStack=%pkB\nAverageSharedMemory=%XkB\nPageSize=%ZB\nMajorPageFaults=%F\nMinorPageFaults=%R\nSwaps=%W\nForcedSwitches=%c\nWaitSwitches=%w\nInputs=%I\nOutputs=%O\nSocketReceived=%r\nSocketSent=%s\nSignals=%k\n' ./runpilot2-wrapper.sh -q BOINC_MCORE -j managed --pilot-user ATLAS --harvester-submit-mode PUSH -w generic --job-type managed --resource-type SCORE_HIMEM --pilotversion 3.8.2.8 -z -t --piloturl local --mute --container
ID: 50750 · Report as offensive     Reply Quote
TheClockworkPirate

Send message
Joined: 1 Oct 24
Posts: 4
Credit: 153,653
RAC: 0
Message 50771 - Posted: 10 Oct 2024, 8:17:27 UTC - in response to Message 50750.  

Namespaces are enabled:
% echo "$(</proc/sys/user/max_user_namespaces)"
513573


The boinc user isn't a member of the apptainer group...
% groups boinc
vboxusers vboxsf boinc


but there isn't one in /etc/group:
% grep ap /etc/group
brlapi:x:972:brltty


but this doesn't seem to impact the boinc user running apptainer:
% sudo --user=boinc apptainer version
1.3.4
% sudo --user=boinc apptainer buildcfg
PACKAGE_NAME=apptainer
PACKAGE_VERSION=1.3.4
BUILDDIR=/build/apptainer/src/apptainer/builddir
PREFIX=/usr
EXECPREFIX=/usr
BINDIR=/usr/bin
SBINDIR=/usr/sbin
LIBEXECDIR=/usr/lib
DATAROOTDIR=/usr/share
DATADIR=/usr/share
SYSCONFDIR=/etc
SHAREDSTATEDIR=/usr/com
LOCALSTATEDIR=/var/lib
RUNSTATEDIR=/var/lib/run
INCLUDEDIR=/usr/include
DOCDIR=/usr/share/doc/apptainer
INFODIR=/usr/share/info
LIBDIR=/usr/lib
LOCALEDIR=/usr/share/locale
MANDIR=/usr/share/man
APPTAINER_CONFDIR=/etc/apptainer
SESSIONDIR=/var/lib/apptainer/mnt/session
PLUGIN_ROOTDIR=/usr/lib/apptainer/plugin
APPTAINER_CONF_FILE=/etc/apptainer/apptainer.conf
APPTAINER_SUID_INSTALL=0
ID: 50771 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2626
Credit: 266,225,201
RAC: 126,229
Message 50772 - Posted: 10 Oct 2024, 13:46:36 UTC - in response to Message 50771.  

I'm running out of ideas.
Please do the following tests.

Start an ATLAS task and let it run.

After 1-2 min run the following commands as normal user in this order:
ls /cvmfs/cms-ib.cern.ch/
sudo systemctl --no-pager status autofs.service
cvmfs_config showconfig -s cvmfs-config
cvmfs_config showconfig -s cms-ib

As soon as the ATLAS task has finished (failed?) repeat the 2nd command (sudo systemctl ...)

Post all command outputs.
ID: 50772 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2262
Credit: 175,581,097
RAC: 652
Message 51460 - Posted: 28 Jan 2025, 11:51:22 UTC
Last modified: 28 Jan 2025, 12:17:28 UTC

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
couldn't start app: Task file vboxwrapper_26206_windows_x86_64.exe: file missing</message>
]]>
This Threadripper is running Atlas:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10797673
ID: 51460 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1444
Credit: 9,704,984
RAC: 918
Message 51461 - Posted: 28 Jan 2025, 12:32:17 UTC - in response to Message 51460.  
Last modified: 28 Jan 2025, 14:13:37 UTC

You probably deleted vboxwrapper_26206_windows_x86_64.exe yourself, because v26207 and v26208 are also present in the projects directory.
V26206 is still used by ATLAS. The other 2 vboxwrappers were/are used by Theory and CMS.
ID: 51461 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2262
Credit: 175,581,097
RAC: 652
Message 51462 - Posted: 28 Jan 2025, 13:21:19 UTC - in response to Message 51461.  

Thank you Crystal,
looking tomorrow in this PC.
Today was for me Patchday with Windows.
ID: 51462 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1444
Credit: 9,704,984
RAC: 918
Message 51463 - Posted: 28 Jan 2025, 14:19:28 UTC - in response to Message 51462.  
Last modified: 28 Jan 2025, 14:23:13 UTC

Thank you Crystal,
looking tomorrow in this PC.
Today was for me Patchday with Windows.

I was wrong again. I must have looked at your results.

From client_state.xml
<app_version>
<app_name>ATLAS</app_name>
<version_num>301</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>4.000000</avg_ncpus>
<flops>9888112149.654190</flops>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<api_version>7.21.0</api_version>
<cmdline>--memory_size_mb 4800 --nthreads 4</cmdline>
<file_ref>
<file_name>vboxwrapper_26206_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>ATLAS_vbox_3.01_job.xml</file_name>
<open_name>vbox_job.xml</open_name>
</file_ref>
<file_ref>
<file_name>ATLAS_vbox_3.01_image.vdi</file_name>
</file_ref>
<dont_throttle/>
<is_wrapper/>
<needs_network/>
</app_version>

If you don't want to reset the project, you could download the right exe: http://lhcathome-upload.cern.ch/lhcathome/download//vboxwrapper_26206_windows_x86_64.exe
ID: 51463 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2262
Credit: 175,581,097
RAC: 652
Message 51464 - Posted: 28 Jan 2025, 14:28:14 UTC - in response to Message 51463.  
Last modified: 28 Jan 2025, 14:31:26 UTC

Have made a reset of the project and
seeing a timestamp-difference of 20 minutes for vboxwrapper 206 for this two Threadripper.
Will reboot the second PC, but there is a lot WCG and E@H work atm.
Thanks for the link of the 206-wrapper.

One Atlas is finishing atm from this morning:
2025-01-28 15:25:28 (14024): Guest Log: Looking for outputfile HITS.43078907._001276.pool.root.1
2025-01-28 15:25:28 (14024): Guest Log: HITS file was successfully produced
ID: 51464 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 130,069,106
RAC: 112,080
Message 51706 - Posted: 17 Mar 2025, 15:11:46 UTC

today, all ATLAS tasks received by several of my hosts errored out after about 5 minutes with stderr showing

"pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found"

for complete info see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=420216659
ID: 51706 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1862
Credit: 130,069,106
RAC: 112,080
Message 51749 - Posted: 24 Mar 2025, 8:43:56 UTC - in response to Message 51706.  

today, all ATLAS tasks received by several of my hosts errored out after about 5 minutes with stderr showing

"pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found"

for complete info see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=420216659
same problem with some of the tasks my hosts downloaded today.
No idea why these faulty tasks are still being sent out to us
ID: 51749 · Report as offensive     Reply Quote

Message boards : ATLAS application : Computation Errors


©2025 CERN