Message boards : ATLAS application : All new tasks failing after about 5 minutes
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 45636 - Posted: 8 Nov 2021, 14:35:49 UTC

ID: 45636 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,148,997
RAC: 15,990
Message 45637 - Posted: 8 Nov 2021, 14:42:51 UTC

I have one of those too. Yours and mine seem to fail with:
2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,869 [wrapper] local tarball pilot2.tar.gz exists OK
2021-11-08 11:48:55 (19628): Guest Log: gzip: stdin: unexpected end of file
2021-11-08 11:48:55 (19628): Guest Log: tar: Skipping to next header
2021-11-08 11:48:55 (19628): Guest Log: tar: Child returned status 1
2021-11-08 11:48:55 (19628): Guest Log: tar: Error is not recoverable: exiting now
2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,894 [wrapper] ERROR: pilot extraction failed for pilot2.tar.gz
2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,895 [wrapper] ERROR: pilot extraction failed for pilot2.tar.gz
2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,896 [wrapper] FATAL: failed to get pilot code
2021-11-08 11:48:55 (19628): Guest Log: 2021-11-08 09:48:54,897 [wrapper] FATAL: failed to get pilot code

ID: 45637 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 45638 - Posted: 8 Nov 2021, 14:50:36 UTC - in response to Message 45637.  

Thanks, Harri, for the information. So at least I know that the problem seems not to be located here with my system.
ID: 45638 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,098
RAC: 105,021
Message 45639 - Posted: 8 Nov 2021, 14:51:29 UTC

native Atlas have also some tasks with Errors:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=175055033
ID: 45639 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 45641 - Posted: 8 Nov 2021, 14:55:58 UTC - in response to Message 45639.  

native Atlas have also some tasks with Errors:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=175055033

okay - seems like a faulty batch :-(
ID: 45641 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 45643 - Posted: 8 Nov 2021, 15:46:21 UTC

the next few ones seem to be okay so far
ID: 45643 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 45650 - Posted: 8 Nov 2021, 21:19:32 UTC

well, there were some more during this evening :-(
ID: 45650 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,338,329
RAC: 101,962
Message 45656 - Posted: 9 Nov 2021, 21:18:03 UTC

these faulty tasks are still coming in, like here about 1 hour ago:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=332106332
ID: 45656 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,098
RAC: 105,021
Message 45657 - Posted: 9 Nov 2021, 21:48:50 UTC

In the morning had also one:https://lhcathome.cern.ch/lhcathome/result.php?resultid=332081571
2021-11-09 10:10:31 (3516): Guest Log: 2021-11-09 09:10:30,186 [wrapper] Using piloturl: local
2021-11-09 10:10:31 (3516): Guest Log: 2021-11-09 09:10:30,190 [wrapper] piloturl=local so download not needed
2021-11-09 10:10:31 (3516): Guest Log: 2021-11-09 09:10:30,192 [wrapper] local tarball pilot2.tar.gz exists OK
2021-11-09 10:10:31 (3516): Guest Log: gzip: stdin: unexpected end of file
2021-11-09 10:10:31 (3516): Guest Log: tar: Skipping to next header
2021-11-09 10:10:31 (3516): Guest Log: tar: Child returned status 1
ID: 45657 · Report as offensive     Reply Quote
Cody

Send message
Joined: 23 Sep 11
Posts: 2
Credit: 14,365,711
RAC: 0
Message 45696 - Posted: 14 Nov 2021, 19:26:13 UTC

My Atlas tasks are also failing after 10 minutes +- 40 seconds. CMS vbox tasks complete on the same host.

System setup Windows 10 recent install, current BOINC verison with virtual box installed.
12 cpu, 32 gb memory, twin 1050ti cards.

One example of a failed task is below

https://lhcathome.cern.ch/lhcathome/result.php?resultid=332861223

Looking at the log it looks like it fails almost immediately, just 4 seconds after starting, so I'm not sure where that 10 minutes is calculated from.


[2021-11-14 15:19:18] Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/16 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 sh start_atlas.sh
[2021-11-14 15:19:18] Job failed
[2021-11-14 15:19:18] FATAL: container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/alternatives directory: mkdir /var/lib/alternatives: read-only file system
[2021-11-14 15:19:18] ./runtime_log.err
[2021-11-14 15:19:18] ./runtime_log

Any ideas where I should look to fix this ?

C


Full log below.

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
15:19:14 (64254): wrapper (7.7.26015): starting
15:19:14 (64254): wrapper: running run_atlas (--nthreads 8)
[2021-11-14 15:19:14] Arguments: --nthreads 8
[2021-11-14 15:19:14] Threads: 8
[2021-11-14 15:19:14] Checking for CVMFS
[2021-11-14 15:19:15] Probing /cvmfs/atlas.cern.ch... OK
[2021-11-14 15:19:15] Probing /cvmfs/atlas-condb.cern.ch... OK
[2021-11-14 15:19:15] Running cvmfs_config stat atlas.cern.ch
[2021-11-14 15:19:15] VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
[2021-11-14 15:19:15] 2.8.2.0 64387 0 24772 95926 4 1 2997157 4096000 0 130560 0 0 100.000 0 0 http://cernvmfs.gridpp.rl.ac.uk:8000/cvmfs/atlas.cern.ch DIRECT 1
[2021-11-14 15:19:15] CVMFS is ok
[2021-11-14 15:19:15] Efficiency of ATLAS tasks can be improved by the following measure(s):
[2021-11-14 15:19:15] The CVMFS client on this computer should be configured to use Cloudflare's openhtc.io.
[2021-11-14 15:19:15] Small home clusters do not require a local http proxy but it is suggested if
[2021-11-14 15:19:15] more than 10 cores throughout the same LAN segment are regularly running ATLAS like tasks.
[2021-11-14 15:19:15] Further information can be found at the LHC@home message board.
[2021-11-14 15:19:15] Using singularity image /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7
[2021-11-14 15:19:15] Checking for singularity binary...
[2021-11-14 15:19:15] Singularity is not installed, using version from CVMFS
[2021-11-14 15:19:15] Checking singularity works with /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 hostname
[2021-11-14 15:19:16] thor
[2021-11-14 15:19:16] Singularity works
[2021-11-14 15:19:18] Set ATHENA_PROC_NUMBER=8
[2021-11-14 15:19:18] Starting ATLAS job with PandaID=5254591102
[2021-11-14 15:19:18] Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/16 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 sh start_atlas.sh
[2021-11-14 15:19:18] Job failed
[2021-11-14 15:19:18] FATAL: container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/alternatives directory: mkdir /var/lib/alternatives: read-only file system
[2021-11-14 15:19:18] ./runtime_log.err
[2021-11-14 15:19:18] ./runtime_log
15:29:18 (64254): run_atlas exited; CPU time 0.330064
15:29:18 (64254): app exit status: 0x1
15:29:18 (64254): called boinc_finish(195)

</stderr_txt>
]]>
ID: 45696 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,098
RAC: 105,021
Message 45697 - Posted: 14 Nov 2021, 20:01:49 UTC - in response to Message 45696.  
Last modified: 14 Nov 2021, 20:13:12 UTC

Your Windows don't have AMD-V enabled (SVM in BIOS) and your Extension Pack in Virtualbox is not installed:
2021-11-08 17:44:06 (12184): Required extension pack not installed, remote desktop not enabled.
2021-11-08 17:44:06 (12184): Enabling shared directory for VM.
2021-11-08 17:44:06 (12184): Starting VM using VBoxManage interface. (boinc_37cb8c7ab419cb26, slot#17)
2021-11-08 17:44:07 (12184): Error in start VM for VM: -2147467259
Command:
VBoxManage -q startvm "boinc_37cb8c7ab419cb26" --type headless
Output:
Waiting for VM "boinc_37cb8c7ab419cb26" to power on...
VBoxManage.exe: error: Not in a hypervisor partition (HVP=0) (VERR_NEM_NOT_AVAILABLE).
VBoxManage.exe: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)

For Linux is a Installation-Guide how to install it correct:
CMS is running between 11 and 13 hours normally. Your tasks are faulty.
ID: 45697 · Report as offensive     Reply Quote
Cody

Send message
Joined: 23 Sep 11
Posts: 2
Credit: 14,365,711
RAC: 0
Message 45700 - Posted: 15 Nov 2021, 0:56:22 UTC - in response to Message 45697.  

Thanks for pointing that out. I've 3 boxes and missed that it was misconfigured on that system, now corrected.

I've two linux hosts, one windows 10 box, Atlas is failing on all of of them. I've double checked the VB extensions is installed on all systems, and that it matches the version of vbox. Will check to see how that impacts the tasks overnight.

C
ID: 45700 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,893,032
RAC: 138,165
Message 45703 - Posted: 15 Nov 2021, 7:43:27 UTC - in response to Message 45700.  

Regarding this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10700470

Although your CMS tasks get credit points they don't deliver scientific results.
This is mainly caused by a couple of restarts while the job setup is not completely finished.
The reason for that is not included in the logs, so most likely forced by BOINC - could be because too many tasks are running concurrently and the BOINC client hits some limits.
The VMs usually stop when they request additional RAM to run the scientific scripts.

Some typical logfile lines with extra comments from task https://lhcathome.cern.ch/lhcathome/result.php?resultid=332886550
2021-11-15 04:28:49 (240): Detected: vboxwrapper 26202
### first try
2021-11-15 04:28:49 (240): Detected: BOINC client v7.16.20
.
.
.
### The VM enters the main application script ...
2021-11-15 04:29:54 (240): Guest Log: [INFO] CMS application starting. Check log files.
### ... and gets a stop signal just a few seconds later.
2021-11-15 04:30:01 (240): Stopping VM.
2021-11-15 04:30:04 (240): Successfully stopped VM.
.
.
.
### second try
2021-11-15 05:35:24 (14832): Detected: vboxwrapper 26202
.
.
.
2021-11-15 05:46:44 (14832): Stopping VM.
2021-11-15 05:46:46 (14832): Successfully stopped VM.
.
.
.
### third try
2021-11-15 05:50:13 (15748): Detected: vboxwrapper 26202
.
.
.
2021-11-15 06:46:44 (15748): Powering off VM.
2021-11-15 06:46:45 (15748): Successfully stopped VM.
2021-11-15 06:46:45 (15748): Deregistering VM. (boinc_61e8f3b959bdc3f8, slot#13)
2021-11-15 06:46:45 (15748): Removing network bandwidth throttle group from VM.
2021-11-15 06:46:45 (15748): Removing VM from VirtualBox.
06:46:51 (15748): called boinc_finish(0)

The last try finished within less than 1 h - much too short to setup the VM and run a complete subtask, even for the fastest CPU you can get today.




Regarding this computer:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10674787
The CVMFS client on this computer works but it is configured to use the original stratum-one-servers (here: cernvmfs.gridpp.rl.ac.uk).
CERN requests LHC@home users not to do this but to configure openhtc.io instead.
See:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5594
[2021-11-15 04:57:43] 2.8.2.0 4860 0 24676 95961 3 1 3688245 4096001 0 130560 0 0 100.000 0 0 http://cernvmfs.gridpp.rl.ac.uk:8000/cvmfs/atlas.cern.ch DIRECT 1
[2021-11-15 04:57:43] CVMFS is ok


Although the logs report "singularity works" this is only true for very basic commands.
Especially on non CentOS computers a local Singularity installation works more reliable.
This could avoid errors like this:
[2021-11-15 04:57:44] Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/14 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 sh start_atlas.sh
[2021-11-15 04:57:44] Job failed
[2021-11-15 04:57:44] FATAL:   container creation failed: hook function for tag prelayer returns error: failed to create /var/lib/alternatives directory: mkdir /var/lib/alternatives: read-only file system
ID: 45703 · Report as offensive     Reply Quote

Message boards : ATLAS application : All new tasks failing after about 5 minutes


©2024 CERN