Message boards : ATLAS application : ATLAS vbox version 2.00
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 49,066,038
RAC: 27,302
Message 40207 - Posted: 19 Oct 2019, 11:41:40 UTC

LHC has stopped sending me ATLAS tasks. Last sent was on 17th of Oct. The Host is running an older version of VBox (5.1.30) but that shouldn't be a problem, it ran a dozen or so 2.00 version tasks successfully before stopped receiving them. I get Theory, CMS and sixtracktest tasks without a problem but Atlas requests are rejected with response 'No ATLAS tasks available' in spite of continuous requests and server status page shows plenty available and over 10000 in progress.

This is the host in question: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10509390
ID: 40207 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 2,013
Message 40209 - Posted: 19 Oct 2019, 14:46:25 UTC

Harri,
Atlas-VM have more than one problem to get work for us.
One test in -dev was to use only 6.0.x for the new CentOs image.
Now it is 5.2.32. You can see this in the log of a older finished task.
The main-problem is for the moment, that a change of thursday was removed back to the version before.
David is this weekend absent, so we have to wait up to monday for clearing it.
-native Atlas is running for the moment.
ID: 40209 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,913,105
RAC: 40,847
Message 40210 - Posted: 19 Oct 2019, 15:38:27 UTC

The recent ATLAS vdi includes VBoxGuestAdditions 5.2.32 which should work even with more recent guest additions on the host - although it is recommended to keep guest and host in sync.

What makes me wonder:
The timestamps of /opt/VBoxGuestAdditions-5.2.32 and below inside the vdi is 2019-09-12 while David introduced v2.0 at 2019-10-09 including a new linux kernel.
I suspect that the vdi's guest additions need to be recompiled to fit into the new kernel.
ID: 40210 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 2,013
Message 40219 - Posted: 20 Oct 2019, 8:11:55 UTC

If we need a new .vdi for this problem, CentOS7 have a Kernel upgrade from
3.10.0-957.27.2.el7.x86_64 to 3.10.0-1062.1.2.el7.x86_64
ID: 40219 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 40229 - Posted: 21 Oct 2019, 9:58:21 UTC - in response to Message 40219.  

The vdi is at kernel version 3.10.0-957.27.2.el7.x86_64 and this is the kernel with which the vbox additions were compiled.

I think Jonathan's stuck task was a victim of the "top" change that was enabled on Thursday and reverted on Friday, rather than a virtualbox version problem. I'm trying to figure out what the problem was with that change.

The other problem at the moment is that the server is not giving out many tasks, I had to click a few times on my client and finally got a single task. But this one is working as normal.
ID: 40229 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 201,266,309
RAC: 27,995
Message 40231 - Posted: 21 Oct 2019, 12:15:09 UTC - in response to Message 40194.  

Please upgrade to 6.0.x. (with ExtPack)

Maybe I missed it but I never saw a word from Projektteam that VirtualBox Version 6.x is okay to use with Atlas. And so long I will stay with 5.x

David,

which version of VirtualBox do you want us to use for the V2 ?

Latest 6.x or 5.x ?????


Supporting BOINC, a great concept !
ID: 40231 · Report as offensive     Reply Quote
[VENETO] boboviz
Avatar

Send message
Joined: 7 May 08
Posts: 217
Credit: 1,575,053
RAC: 200
Message 40233 - Posted: 21 Oct 2019, 15:19:28 UTC - in response to Message 40147.  

Ok, i've installed the latest version of VirtualBox and the latest extensions.

And now seems to be ok

2019-10-21 17:05:28 (5416): Guest Log: HITS file was successfully produced
ID: 40233 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 40240 - Posted: 22 Oct 2019, 10:56:09 UTC - in response to Message 40231.  

Please upgrade to 6.0.x. (with ExtPack)

Maybe I missed it but I never saw a word from Projektteam that VirtualBox Version 6.x is okay to use with Atlas. And so long I will stay with 5.x

David,

which version of VirtualBox do you want us to use for the V2 ?

Latest 6.x or 5.x ?????


The answer is use whatever works :)

I myself have tested with 5.2.32 and 6.0.12 successfully on Linux, but I don't have the means to test all combinations of versions and operating systems. If you find a version that works for you then it's fine to stick with it.

The image itself was created with 5.2.32 because problems were reported during the testing on LHC-dev when an image from 6.0 was used.
ID: 40240 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,812,938
RAC: 115,827
Message 40244 - Posted: 22 Oct 2019, 18:13:41 UTC

Most of my computers are on 5.1.38, if I run my computer at 1core/wu then this is more reliable than the newer versions.

I loaded 6.0.14 to see what happens, 6.0.12 is reliable running theory and CMS as long as the CPU use is about 75%
ID: 40244 · Report as offensive     Reply Quote
Filipe

Send message
Joined: 9 Aug 05
Posts: 36
Credit: 7,698,293
RAC: 0
Message 40274 - Posted: 25 Oct 2019, 10:20:34 UTC

Is there a run time limit for the Atlas tasks?

I've got 3 of them running for 7days.
ID: 40274 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1418
Credit: 9,470,586
RAC: 3,147
Message 40279 - Posted: 25 Oct 2019, 11:19:30 UTC - in response to Message 40274.  

7 days continuous running is too long for a multi-core ATLAS task.
Consider aborting the tasks and try new ones with the improved Console monitoring.
ID: 40279 · Report as offensive     Reply Quote
Profile Mumak
Avatar

Send message
Joined: 14 Feb 14
Posts: 5
Credit: 17,818,305
RAC: 0
Message 40285 - Posted: 25 Oct 2019, 14:14:38 UTC

I've got 2 Atlas (1T) tasks, each of them running for over 7 days. Console shows:

Total number of events to be processed: 200
Total number of events already finished: 54
Time left: 0d 14h 38m
---
Last finished...
worker 1: Event nr. 54 took 397.5 s. New average 348.4 +- 9.882

top shows athena.py @ 99% CPU
ID: 40285 · Report as offensive     Reply Quote
Profile Mumak
Avatar

Send message
Joined: 14 Feb 14
Posts: 5
Credit: 17,818,305
RAC: 0
Message 40286 - Posted: 25 Oct 2019, 14:31:08 UTC - in response to Message 40285.  

15 minutes later the number of jobs finished increased to 57.
Wondering what the machine has been doing during the past days...
ID: 40286 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,913,105
RAC: 40,847
Message 40287 - Posted: 25 Oct 2019, 14:43:53 UTC - in response to Message 40285.  
Last modified: 25 Oct 2019, 14:48:53 UTC

I've got 2 Atlas (1T) tasks, each of them running for over 7 days. Console shows:

Total number of events to be processed: 200
Total number of events already finished: 54
Time left: 0d 14h 38m
---
Last finished...
worker 1: Event nr. 54 took 397.5 s. New average 348.4 +- 9.882

top shows athena.py @ 99% CPU

The "New average" shows "+- 9.882".
Very unusual that this value is that huge.
It points out that there must have been at least 1 longrunner (a veryverylongrunner).
The value is directly logged by the scientific app.

All other values are looking normal, especially "athena.py @ 99% CPU", but there's still a bit work to do.

I would let it run until it finishes or it hits the BOINC due date.


<edit>
Sorry my fault.
It's a "." instead of a ",".
Typical german misinterpretation.
But then you are right when you ask what the machine has done.
</edit>
ID: 40287 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 40346 - Posted: 31 Oct 2019, 16:17:29 UTC - in response to Message 40287.  
Last modified: 31 Oct 2019, 16:18:10 UTC

I manually aborted 1 doing-nothing task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=250466685
Why did this task not get aborted by client (e.g. exit status: 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT)?
2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!


2 other tasks still running. ;)
2019-10-31 03:26:08 (30814): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-10-31 03:26:08 (30814): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2019-10-31 03:26:11 (30814): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2019-10-31 17:06:19 (1960): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-10-31 17:06:19 (1960): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2019-10-31 17:06:23 (1960): Guest Log: Probing /cvmfs/grid.cern.ch... OK


New task running.
2019-10-31 13:57:56 (12291): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... OK
2019-10-31 13:57:59 (12291): Guest Log: Probing /cvmfs/grid.cern.ch... OK
Third probing try missing. :)
ID: 40346 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 40352 - Posted: 4 Nov 2019, 1:55:47 UTC - in response to Message 40346.  

When the VM is booted, the following happens:

- A small script is automatically run
- This script checks CVMFS by running the probe command
- If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary)
- The script then copies another script (the "bootstrap script") from CVMFS and runs it. The bootstrap script takes care of setting everything up for the task then starting the real work

It is done like this so that we can make changes simply by putting a new version of the bootstrap script on CVMFS instead of having to create a new VM image and app version each time.

I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week.
ID: 40352 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 40354 - Posted: 4 Nov 2019, 9:42:15 UTC - in response to Message 40352.  

- If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary)
So, does script check it again or does something else try to download a job just the same? What's the interval?

I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week.
Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive).
ID: 40354 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 40356 - Posted: 4 Nov 2019, 20:03:05 UTC - in response to Message 40354.  
Last modified: 4 Nov 2019, 20:30:33 UTC

Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive).
Ok, it should work.

#!/bin/bash

boinc_path="/home/luis/Applicazioni/boinc"
lhc_project_url="https://lhcathome.cern.ch/lhcathome/"
atlas_app_name="ATLAS"
boinccmd="./boinccmd"

function isAtlasTask()
{
	init_data="$boinc_path/slots/$1/init_data.xml"
	if [ -e "$init_data" ]; then
		app_name=$(sed -n 's|[^<]*<app_name>\([^<]*\)</app_name>[^<]*|\1\n|gp' $init_data)
		if [[ "$app_name" == "$atlas_app_name" ]]; then
			return 1
		else
			return 0
		fi
	fi
	return 0
}

slot_dirs=( $(ls "$boinc_path/slots"))
ndirs=${#slot_dirs[@]}
for (( i = 0; i < ndirs; i++ )) do
	isAtlasTask $i
	if [ $? -eq 1 ]; then
		stderr="$boinc_path/slots/$i/stderr.txt"
		c=0
		while IFS= read -r line; do
			if [[ "$line" == *"Probing /cvmfs/"*"... Failed!" ]]; then
				c=$((c+1))
			fi
		done < "$stderr"
		echo "$c probing fails found in $stderr"
		if [ $c -ge 3 ]; then
			boinc_task_state="$boinc_path/slots/$i/boinc_task_state.xml"
			task_name=$(sed -n 's|[^<]*<result_name>\([^<]*\)</result_name>[^<]*|\1\n|gp' $boinc_task_state)
			cd $boinc_path && $boinccmd --task $lhc_project_url $task_name suspend #abort
			echo "$task_name suspended!" #aborted!"
		fi
	fi
done

When my script gets 3 probing fails, it suspends the concerning atlas task.
Edit boinc_path variable to your boinc path.
Edit boinccmd variabile depending on whether you have standalone or service boinc client.
Delete "suspend" and uncomment "abort" if you want a more destructive behaviour.

Call this script every 15 minutes by a command line like this:
watch -n 900 /your_script_path/AtlasProbingFailedCheck.sh
ID: 40356 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,913,105
RAC: 40,847
Message 40357 - Posted: 4 Nov 2019, 21:40:41 UTC - in response to Message 40356.  

Nice little script but it should be investigated why CVMFS fails.

CVMFS configuration usually lists a couple of servers to be used either
- as main server followed by a couple of spare servers if the main server fails or
- as a list of servers to be tested by the CVMFS geolocation API. In this case the nearest server will be used.


Typical config entries

/etc/cvmfs/default.local
CVMFS_REPOSITORIES="atlas.cern.ch,atlas-condb.cern.ch,grid.cern.ch,cernvm-prod.cern.ch"


/etc/cvmfs/domain.d/cern.ch.local
CVMFS_SERVER_URL="http://s1cern-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1ral-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1bnl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1fnal-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1unl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1asgc-cvmfs.openhtc.io:8080/cvmfs/@fqrn@;http://s1ihep-cvmfs.openhtc.io/cvmfs/@fqrn@"

# set to 'yes' activates the geo API, set to 'no' deactivates it
CVMFS_USE_GEOAPI=yes


It should be checked (by the project team) if CVMFS_SERVER_URL lists at least 4 servers. Then it's very unlikely that all of them fail at the same moment.

Client side issues could be:
- wrong firewall settings, e.g. closed ports or filtered destinations
- slow DNS resolving
- high load on the router (not the same as high bandwidth usage!) that causes timeouts
ID: 40357 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1418
Credit: 9,470,586
RAC: 3,147
Message 40361 - Posted: 6 Nov 2019, 10:49:18 UTC - in response to Message 40146.  

Your logfile shows that the task started/paused several times and couldn't successfully write it's snapshot
I crunch with my notebook, so i turn off it sometimes.
But why not snapshot if we use virtualbox??
I suspended a task for over 20 hours (saved to disk) and after the resume it returned a result with success.

2019-11-04 14:43:55 (10352): Stopping VM.
2019-11-05 10:51:32 (10892): Detected: vboxwrapper 26197

https://lhcathome.cern.ch/lhcathome/result.php?resultid=250890961

Taking a snapshot may not exceed 60 seconds, else the VM-state will be 'aborted' (not saved) and after a resume the task starts from scratch (when you're lucky).
ID: 40361 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : ATLAS application : ATLAS vbox version 2.00


©2024 CERN