Message boards :
ATLAS application :
Computation Errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Oct 24 Posts: 4 Credit: 580,422 RAC: 4,561 |
Hi folks, New to the project, hit a few snags trying to start out (not having CVMFS set up properly etc) that caused a few tasks to fail and report computation error. Managed to work out the kinks with a combination of the forums and the messages in the stderr files. That said, I'm still getting computation errors on tasks like https://lhcathome.cern.ch/lhcathome/result.php?resultid=414674356and I can't work out whether I've missed something or whether there is an issue with the task / WU. Can anyone help shed some light? Thanks! |
![]() Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,887,455 RAC: 54,539 ![]() ![]() |
You may ... 1. Make your computers visible for other volunteers to allow them see the complete picture. 2. Post your boinc client service unit file (plus the override.conf if you use one) |
Send message Joined: 1 Oct 24 Posts: 4 Credit: 580,422 RAC: 4,561 |
Sure, the computer should now be visible to everyone https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10860373 Here are the files: boinc-client.service: [Unit] Description=Berkeley Open Infrastructure Network Computing Client Documentation=man:boinc(1) Wants=vboxdrv.service After=vboxdrv.service network-online.target [Service] Type=simple ProtectHome=true ProtectSystem=full ProtectControlGroups=true ReadWritePaths=-/var/lib/boinc -/etc/boinc-client Nice=10 User=boinc WorkingDirectory=/var/lib/boinc ExecStart=/usr/bin/boinc ExecStop=/usr/bin/boinccmd --quit ExecReload=/usr/bin/boinccmd --read_cc_config ExecStopPost=/bin/rm -f lockfile IOSchedulingClass=idle # The following options prevent setuid root as they imply NoNewPrivileges=true # Since Atlas requires setuid root, they break Atlas # In order to improve security, if you're not using Atlas, # Add these options to the [Service] section of an override file using # sudo systemctl edit boinc-client.service #NoNewPrivileges=true #ProtectKernelModules=true #ProtectKernelTunables=true #RestrictRealtime=true #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX #RestrictNamespaces=true #PrivateUsers=true #CapabilityBoundingSet= #MemoryDenyWriteExecute=true #PrivateTmp=true #Block X11 idle detection [Install] WantedBy=multi-user.target boinc-client.service.d/wayland-syslog-spam.conf: [Service] LogFilterPatterns=~no authorization protocol specified global_prefs_override.xml: <global_preferences> <run_if_user_active>1</run_if_user_active> <run_gpu_if_user_active>1</run_gpu_if_user_active> </global_preferences> cc_config.xml: <cc_config> <options> <report_results_immediately>1</report_results_immediately> <use_all_gpus>1</use_all_gpus> </options> </cc_config> |
![]() Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,887,455 RAC: 54,539 ![]() ![]() |
Try this Create an override file containing either this: [Service] ProtectHome=false ProtectControlGroups=false or this: [Service] ProtectHome=false ProtectSystem=strict ProtectControlGroups=false ReadWritePaths=-/tmp Then - stop BOINC - run "sudo systemctl daemon-reload" - restart BOINC - run 1 native task (more if that one succeeds) |
Send message Joined: 1 Oct 24 Posts: 4 Credit: 580,422 RAC: 4,561 |
Sorry about the delay in reporting back. Thanks for the suggestions, unfortunately neither of the changes / add-ons to the systemd service file made any difference, and ATLAS native tasks still failed with a computation error after ~10 minutes. For good measure, I double checked the cvmfs set up and that looks fine. % cvmfs_config probe Probing /cvmfs/atlas.cern.ch... OK Probing /cvmfs/atlas-condb.cern.ch... OK Probing /cvmfs/grid.cern.ch... OK Probing /cvmfs/cernvm-prod.cern.ch... OK Probing /cvmfs/sft.cern.ch... OK Probing /cvmfs/alice.cern.ch... OK % cvmfs_config stat Running /usr/bin/cvmfs_config stat cvmfs-config.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.11.5.0 253329 8 33436 35 3 1 2783123 6144000 0 130560 0 198 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/cvmfs-config.cern.ch DIRECT 1 Running /usr/bin/cvmfs_config stat atlas.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.11.5.0 253439 8 38924 138133 3 16 2783123 6144000 0 130560 0 9380 100.000 1 0 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1 Running /usr/bin/cvmfs_config stat atlas-condb.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.11.5.0 253581 8 33484 13795 3 1 2783123 6144000 0 130560 0 2 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas-condb.cern.ch DIRECT 1 Running /usr/bin/cvmfs_config stat sft.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.11.5.0 254069 8 33532 30674 3 1 2783123 6144000 0 130560 0 1 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/sft.cern.ch DIRECT 1 Running /usr/bin/cvmfs_config stat grid.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.11.5.0 258058 6 33420 25433 1 1 2783123 6144000 0 130560 0 0 0.000 7847 1646 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 1 Running /usr/bin/cvmfs_config stat cernvm-prod.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.11.5.0 258200 6 33728 272 1 1 2783123 6144000 0 130560 0 0 100.000 1 1 http://s1cern-cvmfs.openhtc.io/cvmfs/cernvm-prod.cern.ch DIRECT 1 Running /usr/bin/cvmfs_config stat alice.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.11.5.0 258418 6 46104 19995 1 1 2783123 6144000 0 523776 0 0 0.000 6156 824 http://s1cern-cvmfs.openhtc.io/cvmfs/alice.cern.ch DIRECT 1 I did some further digging and I think I've found something in the journal: Oct 09 07:16:54 mummu boinc[244621]: mv: cannot stat 'slots/3/shared/HITS.pool.root.1': No such file or directory Looking back, lines like this coincide with ATLAS tasks finishing with a computation error. I'm guessing I initially missed it down to the line not containing the [LHC@home] project string. Any suggestions as to why this HITS.pool.root.1 file can't be found? What is this file for? |
![]() Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,887,455 RAC: 54,539 ![]() ![]() |
Your CVMFS works fine. Nonetheless you may purge it's local cache running "cvmfs_config wipecache". HITS.pool.root.1 is the temporary name of the result file. Since your tasks fail early they don't create that file which causes stat to fail. Just to be sure ..., - check if your boinc user is a member of the apptainer group. - run "cat /proc/sys/user/max_user_namespaces" to check if (and how many) user namespaces are enabled. The command having a way too short runtime and returning error 64 is this (so far no idea why): /usr/bin/time -o /var/lib/boinc/slots/3/S9hMDmSYbI6nsSi4ap6QjLDmwznN0nGgGQJmDC4LDmmdvKDm9ycesn.diag -a -f 'WallTime=%es\nKernelTime=%Ss\nUserTime=%Us\nCPUUsage=%P\nMaxResidentMemory=%MkB\nAverageResidentMemory=%tkB\nAverageTotalMemory=%KkB\nAverageUnsharedMemory=%DkB\nAverageUnsharedStack=%pkB\nAverageSharedMemory=%XkB\nPageSize=%ZB\nMajorPageFaults=%F\nMinorPageFaults=%R\nSwaps=%W\nForcedSwitches=%c\nWaitSwitches=%w\nInputs=%I\nOutputs=%O\nSocketReceived=%r\nSocketSent=%s\nSignals=%k\n' ./runpilot2-wrapper.sh -q BOINC_MCORE -j managed --pilot-user ATLAS --harvester-submit-mode PUSH -w generic --job-type managed --resource-type SCORE_HIMEM --pilotversion 3.8.2.8 -z -t --piloturl local --mute --container |
Send message Joined: 1 Oct 24 Posts: 4 Credit: 580,422 RAC: 4,561 |
Namespaces are enabled: % echo "$(</proc/sys/user/max_user_namespaces)" 513573 The boinc user isn't a member of the apptainer group... % groups boinc vboxusers vboxsf boinc but there isn't one in /etc/group: % grep ap /etc/group brlapi:x:972:brltty but this doesn't seem to impact the boinc user running apptainer: % sudo --user=boinc apptainer version 1.3.4 % sudo --user=boinc apptainer buildcfg PACKAGE_NAME=apptainer PACKAGE_VERSION=1.3.4 BUILDDIR=/build/apptainer/src/apptainer/builddir PREFIX=/usr EXECPREFIX=/usr BINDIR=/usr/bin SBINDIR=/usr/sbin LIBEXECDIR=/usr/lib DATAROOTDIR=/usr/share DATADIR=/usr/share SYSCONFDIR=/etc SHAREDSTATEDIR=/usr/com LOCALSTATEDIR=/var/lib RUNSTATEDIR=/var/lib/run INCLUDEDIR=/usr/include DOCDIR=/usr/share/doc/apptainer INFODIR=/usr/share/info LIBDIR=/usr/lib LOCALEDIR=/usr/share/locale MANDIR=/usr/share/man APPTAINER_CONFDIR=/etc/apptainer SESSIONDIR=/var/lib/apptainer/mnt/session PLUGIN_ROOTDIR=/usr/lib/apptainer/plugin APPTAINER_CONF_FILE=/etc/apptainer/apptainer.conf APPTAINER_SUID_INSTALL=0 |
![]() Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,887,455 RAC: 54,539 ![]() ![]() |
I'm running out of ideas. Please do the following tests. Start an ATLAS task and let it run. After 1-2 min run the following commands as normal user in this order: ls /cvmfs/cms-ib.cern.ch/ sudo systemctl --no-pager status autofs.service cvmfs_config showconfig -s cvmfs-config cvmfs_config showconfig -s cms-ib As soon as the ATLAS task has finished (failed?) repeat the 2nd command (sudo systemctl ...) Post all command outputs. |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
<core_client_version>8.0.2</core_client_version> <![CDATA[ <message> couldn't start app: Task file vboxwrapper_26206_windows_x86_64.exe: file missing</message> ]]> This Threadripper is running Atlas: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10797673 |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
You probably deleted vboxwrapper_26206_windows_x86_64.exe yourself, because v26207 and v26208 are also present in the projects directory. V26206 is still used by ATLAS. The other 2 vboxwrappers were/are used by Theory and CMS. |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
Thank you Crystal, looking tomorrow in this PC. Today was for me Patchday with Windows. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,193 RAC: 2,531 ![]() ![]() |
Thank you Crystal, I was wrong again. I must have looked at your results. From client_state.xml <app_version> <app_name>ATLAS</app_name> <version_num>301</version_num> <platform>windows_x86_64</platform> <avg_ncpus>4.000000</avg_ncpus> <flops>9888112149.654190</flops> <plan_class>vbox64_mt_mcore_atlas</plan_class> <api_version>7.21.0</api_version> <cmdline>--memory_size_mb 4800 --nthreads 4</cmdline> <file_ref> <file_name>vboxwrapper_26206_windows_x86_64.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>ATLAS_vbox_3.01_job.xml</file_name> <open_name>vbox_job.xml</open_name> </file_ref> <file_ref> <file_name>ATLAS_vbox_3.01_image.vdi</file_name> </file_ref> <dont_throttle/> <is_wrapper/> <needs_network/> </app_version> If you don't want to reset the project, you could download the right exe: http://lhcathome-upload.cern.ch/lhcathome/download//vboxwrapper_26206_windows_x86_64.exe |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
Have made a reset of the project and seeing a timestamp-difference of 20 minutes for vboxwrapper 206 for this two Threadripper. Will reboot the second PC, but there is a lot WCG and E@H work atm. Thanks for the link of the 206-wrapper. One Atlas is finishing atm from this morning: 2025-01-28 15:25:28 (14024): Guest Log: Looking for outputfile HITS.43078907._001276.pool.root.1 2025-01-28 15:25:28 (14024): Guest Log: HITS file was successfully produced |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,948,283 RAC: 82,341 ![]() ![]() ![]() |
today, all ATLAS tasks received by several of my hosts errored out after about 5 minutes with stderr showing "pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found" for complete info see: https://lhcathome.cern.ch/lhcathome/result.php?resultid=420216659 |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,948,283 RAC: 82,341 ![]() ![]() ![]() |
today, all ATLAS tasks received by several of my hosts errored out after about 5 minutes with stderr showingsame problem with some of the tasks my hosts downloaded today. No idea why these faulty tasks are still being sent out to us |
©2025 CERN