Message boards :
ATLAS application :
ATLAS vbox version 2.00
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 28 Sep 04 Posts: 728 Credit: 48,821,567 RAC: 21,894 |
LHC has stopped sending me ATLAS tasks. Last sent was on 17th of Oct. The Host is running an older version of VBox (5.1.30) but that shouldn't be a problem, it ran a dozen or so 2.00 version tasks successfully before stopped receiving them. I get Theory, CMS and sixtracktest tasks without a problem but Atlas requests are rejected with response 'No ATLAS tasks available' in spite of continuous requests and server status page shows plenty available and over 10000 in progress. This is the host in question: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10509390 |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,092 |
Harri, Atlas-VM have more than one problem to get work for us. One test in -dev was to use only 6.0.x for the new CentOs image. Now it is 5.2.32. You can see this in the log of a older finished task. The main-problem is for the moment, that a change of thursday was removed back to the version before. David is this weekend absent, so we have to wait up to monday for clearing it. -native Atlas is running for the moment. |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 56,522 |
The recent ATLAS vdi includes VBoxGuestAdditions 5.2.32 which should work even with more recent guest additions on the host - although it is recommended to keep guest and host in sync. What makes me wonder: The timestamps of /opt/VBoxGuestAdditions-5.2.32 and below inside the vdi is 2019-09-12 while David introduced v2.0 at 2019-10-09 including a new linux kernel. I suspect that the vdi's guest additions need to be recompiled to fit into the new kernel. |
Send message Joined: 2 May 07 Posts: 2240 Credit: 173,894,884 RAC: 3,092 |
If we need a new .vdi for this problem, CentOS7 have a Kernel upgrade from 3.10.0-957.27.2.el7.x86_64 to 3.10.0-1062.1.2.el7.x86_64 |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
The vdi is at kernel version 3.10.0-957.27.2.el7.x86_64 and this is the kernel with which the vbox additions were compiled. I think Jonathan's stuck task was a victim of the "top" change that was enabled on Thursday and reverted on Friday, rather than a virtualbox version problem. I'm trying to figure out what the problem was with that change. The other problem at the moment is that the server is not giving out many tasks, I had to click a few times on my client and finally got a single task. But this one is working as normal. |
Send message Joined: 2 Sep 04 Posts: 455 Credit: 201,083,954 RAC: 31,440 |
Please upgrade to 6.0.x. (with ExtPack) David, which version of VirtualBox do you want us to use for the V2 ? Latest 6.x or 5.x ????? Supporting BOINC, a great concept ! |
Send message Joined: 7 May 08 Posts: 217 Credit: 1,575,053 RAC: 400 |
Ok, i've installed the latest version of VirtualBox and the latest extensions. And now seems to be ok 2019-10-21 17:05:28 (5416): Guest Log: HITS file was successfully produced |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Please upgrade to 6.0.x. (with ExtPack) The answer is use whatever works :) I myself have tested with 5.2.32 and 6.0.12 successfully on Linux, but I don't have the means to test all combinations of versions and operating systems. If you find a version that works for you then it's fine to stick with it. The image itself was created with 5.2.32 because problems were reported during the testing on LHC-dev when an image from 6.0 was used. |
Send message Joined: 27 Sep 08 Posts: 844 Credit: 690,881,892 RAC: 110,915 |
Most of my computers are on 5.1.38, if I run my computer at 1core/wu then this is more reliable than the newer versions. I loaded 6.0.14 to see what happens, 6.0.12 is reliable running theory and CMS as long as the CPU use is about 75% |
Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0 |
Is there a run time limit for the Atlas tasks? I've got 3 of them running for 7days. |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,440,106 RAC: 1,109 |
7 days continuous running is too long for a multi-core ATLAS task. Consider aborting the tasks and try new ones with the improved Console monitoring. |
Send message Joined: 14 Feb 14 Posts: 5 Credit: 17,818,305 RAC: 0 |
I've got 2 Atlas (1T) tasks, each of them running for over 7 days. Console shows: Total number of events to be processed: 200 Total number of events already finished: 54 Time left: 0d 14h 38m --- Last finished... worker 1: Event nr. 54 took 397.5 s. New average 348.4 +- 9.882 top shows athena.py @ 99% CPU |
Send message Joined: 14 Feb 14 Posts: 5 Credit: 17,818,305 RAC: 0 |
15 minutes later the number of jobs finished increased to 57. Wondering what the machine has been doing during the past days... |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 56,522 |
I've got 2 Atlas (1T) tasks, each of them running for over 7 days. Console shows: The "New average" shows "+- 9.882". Very unusual that this value is that huge. It points out that there must have been at least 1 longrunner (a veryverylongrunner). The value is directly logged by the scientific app. All other values are looking normal, especially "athena.py @ 99% CPU", but there's still a bit work to do. I would let it run until it finishes or it hits the BOINC due date. <edit> Sorry my fault. It's a "." instead of a ",". Typical german misinterpretation. But then you are right when you ask what the machine has done. </edit> |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
I manually aborted 1 doing-nothing task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=250466685 Why did this task not get aborted by client (e.g. exit status: 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT)? 2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2 other tasks still running. ;) 2019-10-31 03:26:08 (30814): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2019-10-31 17:06:19 (1960): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! New task running. 2019-10-31 13:57:56 (12291): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... OKThird probing try missing. :) |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
When the VM is booted, the following happens: - A small script is automatically run - This script checks CVMFS by running the probe command - If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary) - The script then copies another script (the "bootstrap script") from CVMFS and runs it. The bootstrap script takes care of setting everything up for the task then starting the real work It is done like this so that we can make changes simply by putting a new version of the bootstrap script on CVMFS instead of having to create a new VM image and app version each time. I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week. |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
- If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary)So, does script check it again or does something else try to download a job just the same? What's the interval? I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week.Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive). |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raisedOk, it should work. #!/bin/bash boinc_path="/home/luis/Applicazioni/boinc" lhc_project_url="https://lhcathome.cern.ch/lhcathome/" atlas_app_name="ATLAS" boinccmd="./boinccmd" function isAtlasTask() { init_data="$boinc_path/slots/$1/init_data.xml" if [ -e "$init_data" ]; then app_name=$(sed -n 's|[^<]*<app_name>\([^<]*\)</app_name>[^<]*|\1\n|gp' $init_data) if [[ "$app_name" == "$atlas_app_name" ]]; then return 1 else return 0 fi fi return 0 } slot_dirs=( $(ls "$boinc_path/slots")) ndirs=${#slot_dirs[@]} for (( i = 0; i < ndirs; i++ )) do isAtlasTask $i if [ $? -eq 1 ]; then stderr="$boinc_path/slots/$i/stderr.txt" c=0 while IFS= read -r line; do if [[ "$line" == *"Probing /cvmfs/"*"... Failed!" ]]; then c=$((c+1)) fi done < "$stderr" echo "$c probing fails found in $stderr" if [ $c -ge 3 ]; then boinc_task_state="$boinc_path/slots/$i/boinc_task_state.xml" task_name=$(sed -n 's|[^<]*<result_name>\([^<]*\)</result_name>[^<]*|\1\n|gp' $boinc_task_state) cd $boinc_path && $boinccmd --task $lhc_project_url $task_name suspend #abort echo "$task_name suspended!" #aborted!" fi fi done When my script gets 3 probing fails, it suspends the concerning atlas task. Edit boinc_path variable to your boinc path. Edit boinccmd variabile depending on whether you have standalone or service boinc client. Delete "suspend" and uncomment "abort" if you want a more destructive behaviour. Call this script every 15 minutes by a command line like this: watch -n 900 /your_script_path/AtlasProbingFailedCheck.sh |
Send message Joined: 15 Jun 08 Posts: 2528 Credit: 253,722,201 RAC: 56,522 |
Nice little script but it should be investigated why CVMFS fails. CVMFS configuration usually lists a couple of servers to be used either - as main server followed by a couple of spare servers if the main server fails or - as a list of servers to be tested by the CVMFS geolocation API. In this case the nearest server will be used. Typical config entries /etc/cvmfs/default.local CVMFS_REPOSITORIES="atlas.cern.ch,atlas-condb.cern.ch,grid.cern.ch,cernvm-prod.cern.ch" /etc/cvmfs/domain.d/cern.ch.local CVMFS_SERVER_URL="http://s1cern-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1ral-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1bnl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1fnal-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1unl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1asgc-cvmfs.openhtc.io:8080/cvmfs/@fqrn@;http://s1ihep-cvmfs.openhtc.io/cvmfs/@fqrn@" # set to 'yes' activates the geo API, set to 'no' deactivates it CVMFS_USE_GEOAPI=yes It should be checked (by the project team) if CVMFS_SERVER_URL lists at least 4 servers. Then it's very unlikely that all of them fail at the same moment. Client side issues could be: - wrong firewall settings, e.g. closed ports or filtered destinations - slow DNS resolving - high load on the router (not the same as high bandwidth usage!) that causes timeouts |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,440,106 RAC: 1,109 |
I suspended a task for over 20 hours (saved to disk) and after the resume it returned a result with success.Your logfile shows that the task started/paused several times and couldn't successfully write it's snapshotI crunch with my notebook, so i turn off it sometimes. 2019-11-04 14:43:55 (10352): Stopping VM. 2019-11-05 10:51:32 (10892): Detected: vboxwrapper 26197 https://lhcathome.cern.ch/lhcathome/result.php?resultid=250890961 Taking a snapshot may not exceed 60 seconds, else the VM-state will be 'aborted' (not saved) and after a resume the task starts from scratch (when you're lucky). |
©2024 CERN