ATLAS vbox version 2.00

Author	Message
Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 763 Credit: 56,499,436 RAC: 28,972	Message 40207 - Posted: 19 Oct 2019, 11:41:40 UTC LHC has stopped sending me ATLAS tasks. Last sent was on 17th of Oct. The Host is running an older version of VBox (5.1.30) but that shouldn't be a problem, it ran a dozen or so 2.00 version tasks successfully before stopped receiving them. I get Theory, CMS and sixtracktest tasks without a problem but Atlas requests are rejected with response 'No ATLAS tasks available' in spite of continuous requests and server status page shows plenty available and over 10000 in progress. This is the host in question: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10509390 ID: 40207 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 148	Message 40209 - Posted: 19 Oct 2019, 14:46:25 UTC Harri, Atlas-VM have more than one problem to get work for us. One test in -dev was to use only 6.0.x for the new CentOs image. Now it is 5.2.32. You can see this in the log of a older finished task. The main-problem is for the moment, that a change of thursday was removed back to the version before. David is this weekend absent, so we have to wait up to monday for clearing it. -native Atlas is running for the moment. ID: 40209 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 274,344,010 RAC: 105,071	Message 40210 - Posted: 19 Oct 2019, 15:38:27 UTC The recent ATLAS vdi includes VBoxGuestAdditions 5.2.32 which should work even with more recent guest additions on the host - although it is recommended to keep guest and host in sync. What makes me wonder: The timestamps of /opt/VBoxGuestAdditions-5.2.32 and below inside the vdi is 2019-09-12 while David introduced v2.0 at 2019-10-09 including a new linux kernel. I suspect that the vdi's guest additions need to be recompiled to fit into the new kernel. ID: 40210 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 148	Message 40219 - Posted: 20 Oct 2019, 8:11:55 UTC If we need a new .vdi for this problem, CentOS7 have a Kernel upgrade from 3.10.0-957.27.2.el7.x86_64 to 3.10.0-1062.1.2.el7.x86_64 ID: 40219 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 40229 - Posted: 21 Oct 2019, 9:58:21 UTC - in response to Message 40219. The vdi is at kernel version 3.10.0-957.27.2.el7.x86_64 and this is the kernel with which the vbox additions were compiled. I think Jonathan's stuck task was a victim of the "top" change that was enabled on Thursday and reverted on Friday, rather than a virtualbox version problem. I'm trying to figure out what the problem was with that change. The other problem at the moment is that the server is not giving out many tasks, I had to click a few times on my client and finally got a single task. But this one is working as normal. ID: 40229 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 455 Credit: 213,665,777 RAC: 18,409	Message 40231 - Posted: 21 Oct 2019, 12:15:09 UTC - in response to Message 40194. Please upgrade to 6.0.x. (with ExtPack) Maybe I missed it but I never saw a word from Projektteam that VirtualBox Version 6.x is okay to use with Atlas. And so long I will stay with 5.x David, which version of VirtualBox do you want us to use for the V2 ? Latest 6.x or 5.x ????? Supporting BOINC, a great concept ! ID: 40231 · Reply Quote

[VENETO] boboviz Send message Joined: 7 May 08 Posts: 236 Credit: 1,575,053 RAC: 0	Message 40233 - Posted: 21 Oct 2019, 15:19:28 UTC - in response to Message 40147. Ok, i've installed the latest version of VirtualBox and the latest extensions. And now seems to be ok 2019-10-21 17:05:28 (5416): Guest Log: HITS file was successfully produced ID: 40233 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 40240 - Posted: 22 Oct 2019, 10:56:09 UTC - in response to Message 40231. Please upgrade to 6.0.x. (with ExtPack) Maybe I missed it but I never saw a word from Projektteam that VirtualBox Version 6.x is okay to use with Atlas. And so long I will stay with 5.x David, which version of VirtualBox do you want us to use for the V2 ? Latest 6.x or 5.x ????? The answer is use whatever works :) I myself have tested with 5.2.32 and 6.0.12 successfully on Linux, but I don't have the means to test all combinations of versions and operating systems. If you find a version that works for you then it's fine to stick with it. The image itself was created with 5.2.32 because problems were reported during the testing on LHC-dev when an image from 6.0 was used. ID: 40240 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 869 Credit: 725,346,358 RAC: 168,028	Message 40244 - Posted: 22 Oct 2019, 18:13:41 UTC Most of my computers are on 5.1.38, if I run my computer at 1core/wu then this is more reliable than the newer versions. I loaded 6.0.14 to see what happens, 6.0.12 is reliable running theory and CMS as long as the CPU use is about 75% ID: 40244 · Reply Quote

Filipe Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0	Message 40274 - Posted: 25 Oct 2019, 10:20:34 UTC Is there a run time limit for the Atlas tasks? I've got 3 of them running for 7days. ID: 40274 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,730,720 RAC: 372	Message 40279 - Posted: 25 Oct 2019, 11:19:30 UTC - in response to Message 40274. 7 days continuous running is too long for a multi-core ATLAS task. Consider aborting the tasks and try new ones with the improved Console monitoring. ID: 40279 · Reply Quote

Mumak Send message Joined: 14 Feb 14 Posts: 5 Credit: 17,818,305 RAC: 0	Message 40285 - Posted: 25 Oct 2019, 14:14:38 UTC I've got 2 Atlas (1T) tasks, each of them running for over 7 days. Console shows: Total number of events to be processed: 200 Total number of events already finished: 54 Time left: 0d 14h 38m --- Last finished... worker 1: Event nr. 54 took 397.5 s. New average 348.4 +- 9.882 top shows athena.py @ 99% CPU ID: 40285 · Reply Quote

Mumak Send message Joined: 14 Feb 14 Posts: 5 Credit: 17,818,305 RAC: 0	Message 40286 - Posted: 25 Oct 2019, 14:31:08 UTC - in response to Message 40285. 15 minutes later the number of jobs finished increased to 57. Wondering what the machine has been doing during the past days... ID: 40286 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 274,344,010 RAC: 105,071	Message 40287 - Posted: 25 Oct 2019, 14:43:53 UTC - in response to Message 40285. Last modified: 25 Oct 2019, 14:48:53 UTC I've got 2 Atlas (1T) tasks, each of them running for over 7 days. Console shows: Total number of events to be processed: 200 Total number of events already finished: 54 Time left: 0d 14h 38m --- Last finished... worker 1: Event nr. 54 took 397.5 s. New average 348.4 +- 9.882 top shows athena.py @ 99% CPU The "New average" shows "+- 9.882". Very unusual that this value is that huge. It points out that there must have been at least 1 longrunner (a veryverylongrunner). The value is directly logged by the scientific app. All other values are looking normal, especially "athena.py @ 99% CPU", but there's still a bit work to do. I would let it run until it finishes or it hits the BOINC due date. <edit> Sorry my fault. It's a "." instead of a ",". Typical german misinterpretation. But then you are right when you ask what the machine has done. </edit> ID: 40287 · Reply Quote

Luigi R. Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0	Message 40346 - Posted: 31 Oct 2019, 16:17:29 UTC - in response to Message 40287. Last modified: 31 Oct 2019, 16:18:10 UTC I manually aborted 1 doing-nothing task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=250466685 Why did this task not get aborted by client (e.g. exit status: 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT)? 2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed! 2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/grid.cern.ch... Failed! 2 other tasks still running. ;) 2019-10-31 03:26:08 (30814): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2019-10-31 03:26:08 (30814): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed! 2019-10-31 03:26:11 (30814): Guest Log: Probing /cvmfs/grid.cern.ch... OK 2019-10-31 17:06:19 (1960): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2019-10-31 17:06:19 (1960): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed! 2019-10-31 17:06:23 (1960): Guest Log: Probing /cvmfs/grid.cern.ch... OK New task running. 2019-10-31 13:57:56 (12291): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... OK 2019-10-31 13:57:59 (12291): Guest Log: Probing /cvmfs/grid.cern.ch... OK Third probing try missing. :) ID: 40346 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 40352 - Posted: 4 Nov 2019, 1:55:47 UTC - in response to Message 40346. When the VM is booted, the following happens: - A small script is automatically run - This script checks CVMFS by running the probe command - If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary) - The script then copies another script (the "bootstrap script") from CVMFS and runs it. The bootstrap script takes care of setting everything up for the task then starting the real work It is done like this so that we can make changes simply by putting a new version of the bootstrap script on CVMFS instead of having to create a new VM image and app version each time. I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week. ID: 40352 · Reply Quote

Luigi R. Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0	Message 40354 - Posted: 4 Nov 2019, 9:42:15 UTC - in response to Message 40352. - If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary) So, does script check it again or does something else try to download a job just the same? What's the interval? I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week. Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive). ID: 40354 · Reply Quote

Luigi R. Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0	Message 40356 - Posted: 4 Nov 2019, 20:03:05 UTC - in response to Message 40354. Last modified: 4 Nov 2019, 20:30:33 UTC Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/... Failed!" are raised ~~(and those three lines must be consecutive)~~. Ok, it should work. #!/bin/bash boinc_path="/home/luis/Applicazioni/boinc" lhc_project_url="https://lhcathome.cern.ch/lhcathome/" atlas_app_name="ATLAS" boinccmd="./boinccmd" function isAtlasTask() { init_data="$boinc_path/slots/$1/init_data.xml" if [ -e "$init_data" ]; then app_name=$(sed -n 's\|[^<]<app_name>$[^<]$</app_name>[^<]\|\1\n\|gp' $init_data) if [[ "$app_name" == "$atlas_app_name" ]]; then return 1 else return 0 fi fi return 0 } slot_dirs=( $(ls "$boinc_path/slots")) ndirs=${#slot_dirs[@]} for (( i = 0; i < ndirs; i++ )) do isAtlasTask $i if [ $? -eq 1 ]; then stderr="$boinc_path/slots/$i/stderr.txt" c=0 while IFS= read -r line; do if [[ "$line" == "Probing /cvmfs/""... Failed!" ]]; then c=$((c+1)) fi done < "$stderr" echo "$c probing fails found in $stderr" if [ $c -ge 3 ]; then boinc_task_state="$boinc_path/slots/$i/boinc_task_state.xml" task_name=$(sed -n 's\|[^<]<result_name>$[^<]$</result_name>[^<]\|\1\n\|gp' $boinc_task_state) cd $boinc_path && $boinccmd --task $lhc_project_url $task_name suspend #abort echo "$task_name suspended!" #aborted!" fi fi done When my script gets 3 probing fails, it suspends the concerning atlas task. Edit boinc_path* variable to your boinc path. Edit boinccmd variabile depending on whether you have standalone or service boinc client. Delete "suspend" and uncomment "abort" if you want a more destructive behaviour. Call this script every 15 minutes by a command line like this: watch -n 900 /your_script_path/AtlasProbingFailedCheck.sh ID: 40356 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 274,344,010 RAC: 105,071	Message 40357 - Posted: 4 Nov 2019, 21:40:41 UTC - in response to Message 40356. Nice little script but it should be investigated why CVMFS fails. CVMFS configuration usually lists a couple of servers to be used either - as main server followed by a couple of spare servers if the main server fails or - as a list of servers to be tested by the CVMFS geolocation API. In this case the nearest server will be used. Typical config entries /etc/cvmfs/default.local CVMFS_REPOSITORIES="atlas.cern.ch,atlas-condb.cern.ch,grid.cern.ch,cernvm-prod.cern.ch" /etc/cvmfs/domain.d/cern.ch.local CVMFS_SERVER_URL="http://s1cern-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1ral-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1bnl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1fnal-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1unl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1asgc-cvmfs.openhtc.io:8080/cvmfs/@fqrn@;http://s1ihep-cvmfs.openhtc.io/cvmfs/@fqrn@" # set to 'yes' activates the geo API, set to 'no' deactivates it CVMFS_USE_GEOAPI=yes It should be checked (by the project team) if CVMFS_SERVER_URL lists at least 4 servers. Then it's very unlikely that all of them fail at the same moment. Client side issues could be: - wrong firewall settings, e.g. closed ports or filtered destinations - slow DNS resolving - high load on the router (not the same as high bandwidth usage!) that causes timeouts ID: 40357 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,730,720 RAC: 372	Message 40361 - Posted: 6 Nov 2019, 10:49:18 UTC - in response to Message 40146. Your logfile shows that the task started/paused several times and couldn't successfully write it's snapshot I crunch with my notebook, so i turn off it sometimes. But why not snapshot if we use virtualbox?? I suspended a task for over 20 hours (saved to disk) and after the resume it returned a result with success. 2019-11-04 14:43:55 (10352): Stopping VM. 2019-11-05 10:51:32 (10892): Detected: vboxwrapper 26197 https://lhcathome.cern.ch/lhcathome/result.php?resultid=250890961 Taking a snapshot may not exceed 60 seconds, else the VM-state will be 'aborted' (not saved) and after a resume the task starts from scratch (when you're lucky). ID: 40361 · Reply Quote

LHC@home