Message boards :
ATLAS application :
All new tasks failing after about 5 minutes
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,952,270 RAC: 81,308 ![]() ![]() ![]() |
Within the past 3 hours, all newly downloaded tasks have failed after about 5 minutes. See here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=332028487 https://lhcathome.cern.ch/lhcathome/result.php?resultid=332028957 https://lhcathome.cern.ch/lhcathome/result.php?resultid=332036708 Any idea what's going on? |
![]() Send message Joined: 28 Sep 04 Posts: 780 Credit: 60,006,451 RAC: 47,123 ![]() ![]() ![]() |
I have one of those too. Yours and mine seem to fail with: 2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,869 [wrapper] local tarball pilot2.tar.gz exists OK 2021-11-08 11:48:55 (19628): Guest Log: gzip: stdin: unexpected end of file 2021-11-08 11:48:55 (19628): Guest Log: tar: Skipping to next header 2021-11-08 11:48:55 (19628): Guest Log: tar: Child returned status 1 2021-11-08 11:48:55 (19628): Guest Log: tar: Error is not recoverable: exiting now 2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,894 [wrapper] ERROR: pilot extraction failed for pilot2.tar.gz 2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,895 [wrapper] ERROR: pilot extraction failed for pilot2.tar.gz 2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,896 [wrapper] FATAL: failed to get pilot code 2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,897 [wrapper] FATAL: failed to get pilot code ![]() |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,952,270 RAC: 81,308 ![]() ![]() ![]() |
Thanks, Harri, for the information. So at least I know that the problem seems not to be located here with my system. |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
native Atlas have also some tasks with Errors: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=175055033 |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,952,270 RAC: 81,308 ![]() ![]() ![]() |
native Atlas have also some tasks with Errors: okay - seems like a faulty batch :-( |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,952,270 RAC: 81,308 ![]() ![]() ![]() |
the next few ones seem to be okay so far |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,952,270 RAC: 81,308 ![]() ![]() ![]() |
well, there were some more during this evening :-( |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,952,270 RAC: 81,308 ![]() ![]() ![]() |
these faulty tasks are still coming in, like here about 1 hour ago: https://lhcathome.cern.ch/lhcathome/result.php?resultid=332106332 |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
In the morning had also one:https://lhcathome.cern.ch/lhcathome/result.php?resultid=332081571 2021-11-09 10:10:31 (3516): Guest Log: 2021-11-09 09:10:30,186 [wrapper] Using piloturl: local 2021-11-09 10:10:31 (3516): Guest Log: 2021-11-09 09:10:30,190 [wrapper] piloturl=local so download not needed 2021-11-09 10:10:31 (3516): Guest Log: 2021-11-09 09:10:30,192 [wrapper] local tarball pilot2.tar.gz exists OK 2021-11-09 10:10:31 (3516): Guest Log: gzip: stdin: unexpected end of file 2021-11-09 10:10:31 (3516): Guest Log: tar: Skipping to next header 2021-11-09 10:10:31 (3516): Guest Log: tar: Child returned status 1 |
Send message Joined: 23 Sep 11 Posts: 2 Credit: 14,365,711 RAC: 0 ![]() ![]() |
My Atlas tasks are also failing after 10 minutes +- 40 seconds. CMS vbox tasks complete on the same host. System setup Windows 10 recent install, current BOINC verison with virtual box installed. 12 cpu, 32 gb memory, twin 1050ti cards. One example of a failed task is below https://lhcathome.cern.ch/lhcathome/result.php?resultid=332861223 Looking at the log it looks like it fails almost immediately, just 4 seconds after starting, so I'm not sure where that 10 minutes is calculated from. [2021-11-14 15:19:18] Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/16 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 sh start_atlas.sh [2021-11-14 15:19:18] Job failed [2021-11-14 15:19:18] FATAL: container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/alternatives directory: mkdir /var/lib/alternatives: read-only file system [2021-11-14 15:19:18] ./runtime_log.err [2021-11-14 15:19:18] ./runtime_log Any ideas where I should look to fix this ? C Full log below. <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 15:19:14 (64254): wrapper (7.7.26015): starting 15:19:14 (64254): wrapper: running run_atlas (--nthreads 8) [2021-11-14 15:19:14] Arguments: --nthreads 8 [2021-11-14 15:19:14] Threads: 8 [2021-11-14 15:19:14] Checking for CVMFS [2021-11-14 15:19:15] Probing /cvmfs/atlas.cern.ch... OK [2021-11-14 15:19:15] Probing /cvmfs/atlas-condb.cern.ch... OK [2021-11-14 15:19:15] Running cvmfs_config stat atlas.cern.ch [2021-11-14 15:19:15] VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE [2021-11-14 15:19:15] 2.8.2.0 64387 0 24772 95926 4 1 2997157 4096000 0 130560 0 0 100.000 0 0 http://cernvmfs.gridpp.rl.ac.uk:8000/cvmfs/atlas.cern.ch DIRECT 1 [2021-11-14 15:19:15] CVMFS is ok [2021-11-14 15:19:15] Efficiency of ATLAS tasks can be improved by the following measure(s): [2021-11-14 15:19:15] The CVMFS client on this computer should be configured to use Cloudflare's openhtc.io. [2021-11-14 15:19:15] Small home clusters do not require a local http proxy but it is suggested if [2021-11-14 15:19:15] more than 10 cores throughout the same LAN segment are regularly running ATLAS like tasks. [2021-11-14 15:19:15] Further information can be found at the LHC@home message board. [2021-11-14 15:19:15] Using singularity image /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 [2021-11-14 15:19:15] Checking for singularity binary... [2021-11-14 15:19:15] Singularity is not installed, using version from CVMFS [2021-11-14 15:19:15] Checking singularity works with /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 hostname [2021-11-14 15:19:16] thor [2021-11-14 15:19:16] Singularity works [2021-11-14 15:19:18] Set ATHENA_PROC_NUMBER=8 [2021-11-14 15:19:18] Starting ATLAS job with PandaID=5254591102 [2021-11-14 15:19:18] Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/16 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 sh start_atlas.sh [2021-11-14 15:19:18] Job failed [2021-11-14 15:19:18] FATAL: container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/alternatives directory: mkdir /var/lib/alternatives: read-only file system [2021-11-14 15:19:18] ./runtime_log.err [2021-11-14 15:19:18] ./runtime_log 15:29:18 (64254): run_atlas exited; CPU time 0.330064 15:29:18 (64254): app exit status: 0x1 15:29:18 (64254): called boinc_finish(195) </stderr_txt> ]]> |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
Your Windows don't have AMD-V enabled (SVM in BIOS) and your Extension Pack in Virtualbox is not installed: 2021-11-08 17:44:06 (12184): Required extension pack not installed, remote desktop not enabled. 2021-11-08 17:44:06 (12184): Enabling shared directory for VM. 2021-11-08 17:44:06 (12184): Starting VM using VBoxManage interface. (boinc_37cb8c7ab419cb26, slot#17) 2021-11-08 17:44:07 (12184): Error in start VM for VM: -2147467259 Command: VBoxManage -q startvm "boinc_37cb8c7ab419cb26" --type headless Output: Waiting for VM "boinc_37cb8c7ab419cb26" to power on... VBoxManage.exe: error: Not in a hypervisor partition (HVP=0) (VERR_NEM_NOT_AVAILABLE). VBoxManage.exe: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED) For Linux is a Installation-Guide how to install it correct: CMS is running between 11 and 13 hours normally. Your tasks are faulty. |
Send message Joined: 23 Sep 11 Posts: 2 Credit: 14,365,711 RAC: 0 ![]() ![]() |
Thanks for pointing that out. I've 3 boxes and missed that it was misconfigured on that system, now corrected. I've two linux hosts, one windows 10 box, Atlas is failing on all of of them. I've double checked the VB extensions is installed on all systems, and that it matches the version of vbox. Will check to see how that impacts the tasks overnight. C |
![]() Send message Joined: 15 Jun 08 Posts: 2684 Credit: 286,925,932 RAC: 57,322 ![]() ![]() |
Regarding this computer: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10700470 Although your CMS tasks get credit points they don't deliver scientific results. This is mainly caused by a couple of restarts while the job setup is not completely finished. The reason for that is not included in the logs, so most likely forced by BOINC - could be because too many tasks are running concurrently and the BOINC client hits some limits. The VMs usually stop when they request additional RAM to run the scientific scripts. Some typical logfile lines with extra comments from task https://lhcathome.cern.ch/lhcathome/result.php?resultid=332886550 2021-11-15 04:28:49 (240): Detected: vboxwrapper 26202 ### first try 2021-11-15 04:28:49 (240): Detected: BOINC client v7.16.20 . . . ### The VM enters the main application script ... 2021-11-15 04:29:54 (240): Guest Log: [INFO] CMS application starting. Check log files. ### ... and gets a stop signal just a few seconds later. 2021-11-15 04:30:01 (240): Stopping VM. 2021-11-15 04:30:04 (240): Successfully stopped VM. . . . ### second try 2021-11-15 05:35:24 (14832): Detected: vboxwrapper 26202 . . . 2021-11-15 05:46:44 (14832): Stopping VM. 2021-11-15 05:46:46 (14832): Successfully stopped VM. . . . ### third try 2021-11-15 05:50:13 (15748): Detected: vboxwrapper 26202 . . . 2021-11-15 06:46:44 (15748): Powering off VM. 2021-11-15 06:46:45 (15748): Successfully stopped VM. 2021-11-15 06:46:45 (15748): Deregistering VM. (boinc_61e8f3b959bdc3f8, slot#13) 2021-11-15 06:46:45 (15748): Removing network bandwidth throttle group from VM. 2021-11-15 06:46:45 (15748): Removing VM from VirtualBox. 06:46:51 (15748): called boinc_finish(0) The last try finished within less than 1 h - much too short to setup the VM and run a complete subtask, even for the fastest CPU you can get today. Regarding this computer: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10674787 The CVMFS client on this computer works but it is configured to use the original stratum-one-servers (here: cernvmfs.gridpp.rl.ac.uk). CERN requests LHC@home users not to do this but to configure openhtc.io instead. See: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5594 [2021-11-15 04:57:43] 2.8.2.0 4860 0 24676 95961 3 1 3688245 4096001 0 130560 0 0 100.000 0 0 http://cernvmfs.gridpp.rl.ac.uk:8000/cvmfs/atlas.cern.ch DIRECT 1 [2021-11-15 04:57:43] CVMFS is ok Although the logs report "singularity works" this is only true for very basic commands. Especially on non CentOS computers a local Singularity installation works more reliable. This could avoid errors like this: [2021-11-15 04:57:44] Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/14 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 sh start_atlas.sh [2021-11-15 04:57:44] Job failed [2021-11-15 04:57:44] FATAL: container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/alternatives directory: mkdir /var/lib/alternatives: read-only file system |
©2025 CERN