Message boards :
ATLAS application :
ATLAS native version 2.73
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Thanks that could be what i need. I will check this. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 14,965,342 RAC: 949 |
If I understand what CVMFS does, it just mounts a remote filesystem locally.No, it also creates a local, client-side cache at $CVMFS_CACHE_BASE and actively manages it to keep within $CVMFS_QUOTA_LIMIT. If CVMFS is using the defaults and du -hs /var/lib/cvmfs reports only 37MB, then something's broken - is something over-riding the defaults? Usage should roughly match the CACHEUSE column from cvmfs_config stat. I get :~ > md5sum -b /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.imgand don't see anything about "rootfs" in /tmp on my working Atlas-native box. |
Send message Joined: 7 Feb 14 Posts: 99 Credit: 5,180,005 RAC: 0 |
On Xubuntu 14.04.6 I have tried to run 4 single-thread tasks and it doesn't worked. cvmfs_config probe returned 6 OKs. First task started smoothly: https://lhcathome.cern.ch/lhcathome/result.php?resultid=254625003 Then other tasks got this error check cvmfs return values are 0, 256 CVMFS not found, aborting the joband failed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=254625437 And cvmfs_config probe was returning 1 or 2 OKs. This system has a problem with libseccomp too. I will try to fix it by suggested solutions. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4972&postid=40840#40840 On Xubuntu 18.04.3 I have simultaneously started 4 tasks now and they are all running fine. There are 4 athena.py processes indeed. cvmfs_config probe returns 6 OKs as expected. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I have about a 50% success rate in setting up native ATLAS on any given Ubuntu 18.04.3 machine. That is, sometimes it works and sometimes it doesn't. I use the same setup routine I always use (it is copy and paste), except that now I don't install Singularity, but just use the CVMFS version. My most recent (and strangest) case is on a Ryzen 2600 on which I just did a clean install of Ubuntu 18.04 yesterday. Native ATLAS did not work. But I just happened to set up an additional BOINC instance on that machine, so I thought I would try native ATLAS on BOINC2. It worked. It just gets curiouser and curiouser (don't try to translate that from English). I have the feeling that it has something to do with permissions, but I give it all that I can that I know of. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I have the feeling that it has something to do with permissions, but I give it all that I can that I know of. One thing I have found is that none of the files in the BOINC2 folder are shown as "locked". But all of them in the BOINC folder (except for "ATLAS_job_2.54_x86_64-pc-linux-gnu.xml") are shown as "locked". I can remove the lock with "sudo chmod -R 777 /var/lib/boinc-client", but that does not cause them to run. And aborting those jobs and rebooting simply causes the lock to reappear on any newly-downloaded files. I run BOINC 7.16.3 from LocutusOfBorg, granting permissions that work on all other projects (LHC and otherwise). I think the problem needs to be addressed at the LHC end. |
Send message Joined: 15 Jun 08 Posts: 2443 Credit: 231,217,416 RAC: 121,378 |
How did you organize your filesystem/mountpoints? What do you mean with "lock"? Does it mean the user that tries to read a folder has no read/execute permission? Did you check if a username/groupname is missing (old host/new host)? Do all users/groups (old host/new host) have the same ids? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
How did you organize your filesystem/mountpoints? By the standard Ubuntu 18.04.3 installation, along with BOINC 7.16.3 as mentioned. "Lock" means it shows a padlock on the file. The permissions show the owner as BOINC core client (read and write), and Group as boinc (read only). I am a member of the "boinc" group. EDIT: I am also a member of "root". The Others access is read-only. The files have execute permission. BOINC: Computer ID: 10625929 https://lhcathome.cern.ch/lhcathome/results.php?hostid=10625929 BOINC2: Computer ID: 10626020 https://lhcathome.cern.ch/lhcathome/results.php?hostid=10626020 |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I am wondering whether BOINC 7.16.3 changed the permissions. I know I had native ATLAS working on BOINC 7.14.2 installations with no problems. It seems that when I upgraded to 7.16.3 that things started falling apart. Unfortunately, I have no way to test that now, since I don't have any of the earlier installations left on my machines. Don't try to figure out a mess. Thanks for any input though. |
Send message Joined: 20 Nov 19 Posts: 21 Credit: 1,074,330 RAC: 0 |
Users and groups can share names across multiple systems even if they have different UID's and GID's locally. Perhaps that is part of the issue? EDIT: Now that I think about a bit more isn't "boinc" a system group? If so you shouldn't be able to access it's data. See "man pam", "man shadow", and "man login.defs" for more info. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
EDIT: Now that I think about a bit more isn't "boinc" a system group? If so you shouldn't be able to access it's data. See "man pam", "man shadow", and "man login.defs" for more info. That is entirely possible. I will be doing a new installation in a couple of days, and try BOINC 7.9.3. Maybe I can spot the difference with 7.16.3. It could be something else. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I will be doing a new installation in a couple of days, and try BOINC 7.9.3. Maybe I can spot the difference with 7.16.3. It could be something else. Well I was able to get BOINC 7.9.3 working with native ATLAS, after I granted permissions to the boinc-client folder with: "sudo chmod -R 777 /var/lib/boinc-client". The first group of work units before that errored out, but after a reboot following that change, the next group worked properly. Also, I noted that before granting the permissions, there were 12 locked files in the boinc-client folder. That was before I attached to any project, so I would call them "system" files for BOINC. After granting the permissions (and rebooting), there are only four locked files left (all the others are unlocked), plus a couple of locked files for each project. I compared that to BOINC 7.16.3, and all the "system" files remain locked after a reboot following the granting of the permissions. So it is apparent that 7.16.3 locks down more files. It seems that LHC and BOINC need to get together on this. |
Send message Joined: 15 Jun 08 Posts: 2443 Credit: 231,217,416 RAC: 121,378 |
Which files are locked? Be so kind as to post the (filtered) output of "lslocks". |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Be so kind as to post the (filtered) output of "lslocks". Here it is for the BOINC 7.9.3 machine, where I am running both native ATLAS and CMS as my only BOINC projects. I am no longer running LHC on the 7.16.3 machine, but could (in BOINC2) if you need it. $ lslocks COMMAND PID TYPE SIZE MODE M START END PATH wrapper_26015_x 1296 POSIX WRITE 0 0 0 /... vboxwrapper_261 1293 POSIX WRITE 0 0 0 /... update-notifier 9602 FLOCK 0B WRITE 0 0 0 /run/user/1000/u wrapper_26015_x 1291 POSIX WRITE 0 0 0 /... nvidia-persiste 964 POSIX WRITE 0 0 0 /run... cvmfs2 2002 POSIX WRITE 0 1073741824 1073742335 /... gnome-shell 2356 POSIX 1.1K WRITE 0 0 0 /home/jim/.nv/GL gnome-shell 2356 POSIX 18.5K WRITE 0 0 0 /home/jim/.nv/GL vboxwrapper_261 1300 POSIX WRITE 0 0 0 /... smbd-notifyd 7006 POSIX WRITE 0 0 0 /run... lpqd 7008 POSIX WRITE 0 0 0 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... (unknown) 18914 FLOCK WRITE 0 0 0 /... rpcbind 905 FLOCK WRITE 0 0 0 /run... nmbd 1068 POSIX READ 0 4 4 /run... wrapper_26015_x 1290 POSIX WRITE 0 0 0 /... vboxwrapper_261 1299 POSIX WRITE 0 0 0 /... VBoxXPCOMIPCD 1419 POSIX WRITE 0 0 0 /... (unknown) 2533 FLOCK WRITE 0 0 0 /... whoopsie 1061 FLOCK WRITE 0 0 0 /run/lock... FAHClient 1255 POSIX WRITE 0 1073741824 1073742335 /... cvmfs2 2002 FLOCK WRITE 0 0 0 /... cvmfs2 2002 FLOCK WRITE 0 0 0 /... FAHClient 1253 POSIX WRITE 0 0 0 /run... wrapper_26015_x 1295 POSIX WRITE 0 0 0 /... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX WRITE 0 0 0 /run... smbd 6339 POSIX WRITE 0 0 0 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... smbd 6339 POSIX READ 0 4 4 /run... cleanupd 7007 POSIX READ 0 4 4 /run... cleanupd 7007 POSIX READ 0 4 4 /run... cron 937 FLOCK WRITE 0 0 0 /run... wrapper_26015_x 1297 POSIX WRITE 0 0 0 /... vboxwrapper_261 1294 POSIX WRITE 0 0 0 /... systemd-timesyn 904 FLOCK WRITE 0 0 0 /run... nmbd 1068 POSIX READ 0 4 4 /run... nmbd 1068 POSIX WRITE 0 0 0 /run... nmbd 1068 POSIX WRITE 0 0 0 /run... FahCore_21 1269 POSIX WRITE 0 0 0 /... wrapper_26015_x 1292 POSIX WRITE 0 0 0 /... vboxwrapper_261 1298 POSIX WRITE 0 0 0 /... cleanupd 7007 POSIX WRITE 0 0 0 /run... smbd 6339 POSIX READ 0 4 4 /run... |
Send message Joined: 15 Jun 08 Posts: 2443 Credit: 231,217,416 RAC: 121,378 |
Here it is for the BOINC 7.9.3 machine, where I am running both native ATLAS and CMS as my only BOINC projects. Sure? Looks like there are some locks from Folding@home. FAHClient 1253 POSIX WRITE 0 0 0 /run... FahCore_21 1269 POSIX WRITE 0 0 0 /... Beside that the locks set by BOINC processes don't look unusual (wrapper, vboxwrapper, cvmfs2, etc.) What makes me wonder: Do you store some of the BOINC data on samba shares (smbd, nmbd)? If yes, you may try a different filesystem, e.g. ext3/4 or xfs. Another issue that should be checked: If you run additional BOINC clients, each of them must have it's own working directory. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Those are my only BOINC projects. I also run FAH, but only on the GPU, and it should not affect BOINC. I have done it on all my machines (both Windows and Ubuntu) for years without problems. No, I don't store the BOINC data elsewhere. I am not surprised there isn't anything unusual on the BOINC 7.9.3. machine (after I had granted the extra permissions); it works.. But two questions: 1. Why do I have to grant the extra permissions at all? 2. What is wrong with BOINC 7.16.3? That is where the problem lies. I have never needed additional directories for additional BOINC clients by the way. I often do it when running both CPU and GPU clients on the same project (e.g., MilkyWay or Einstein), where the BOINC scheduler has (another) of its scheduling problems. It keeps everything nice and separate. It works for LHC too; I used the original BOINC instance for CMS, and BOINC2 for native ATLAS on the 7.16.3 machine. On the BOINC 7.9.3 machine, I can run them both in the same instance. No real surprises after it is set up, it is just the hoops you have to jump through to get there. |
Send message Joined: 15 Jun 08 Posts: 2443 Credit: 231,217,416 RAC: 121,378 |
Those are my only BOINC projects. I also run FAH, but only on the GPU, and it should not affect BOINC. I have done it on all my machines (both Windows and Ubuntu) for years without problems. OK I have never needed additional directories ... must have it's own working directory. I used the original BOINC instance for CMS, and BOINC2 for native ATLAS There you state that you run 2 instances. This requires 2 BOINC working directories. Otherwise the client instances would fight against each other regarding the /slots/ directories. What is wrong with BOINC 7.16.3? That is where the problem lies Don't know. So far I never tried 7.16.3. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
What is wrong with BOINC 7.16.3? That is where the problem lies I see you stayed on 7.14.2. That is a good idea. It is LHC that needs to address it (maybe in conjunction with BOINC). I can only apply band-aids. Thanks for your input. |
Send message Joined: 2 May 07 Posts: 2152 Credit: 161,132,926 RAC: 56,492 |
Atlas-native (one CPU) runs now for 2 days and 17 hours: Interesting is: short time of a collision, but a long duration. The log.EVNTtoHITS show for the moment: 07:51:58 AthenaEventLoopMgr INFO ===>>> done processing event #9782937, run #284500 137 events processed so far <<<=== 07:51:58 AthenaEventLoopMgr INFO ===>>> start processing event #9782938, run #284500 137 events processed so far <<<=== 08:00:44 08:00:44 -------- WWWW ------- G4Exception-START -------- WWWW ------- 08:00:44 *** G4Exception : GeomNav1002 08:00:44 issued by : G4Navigator::ComputeStep() 08:00:44 Track stuck or not moving. 08:00:44 Track stuck, not moving for 10 steps 08:00:44 in volume -LArMgr::LAr::EMEC::Neg::InnerWheel- at point (234.576,650.26,-4235.54) 08:00:44 direction: (0.862717,-0.424166,0.275323). 08:00:44 Potential geometry or navigation problem ! 08:00:44 Trying pushing it of 1e-07 mm ...Potential overlap in geometry! 08:00:44 08:00:44 *** This is just a warning message. *** 08:00:44 -------- WWWW -------- G4Exception-END --------- WWWW ------- 08:00:44 08:01:04 08:01:04 -------- WWWW ------- G4Exception-START -------- WWWW ------- 08:01:04 *** G4Exception : GeomNav1002 08:01:04 issued by : G4Navigator::ComputeStep() 08:01:04 Track stuck or not moving. 08:01:04 Track stuck, not moving for 10 steps 08:01:04 in volume -LArMgr::LAr::EMEC::Pos::InnerWheel- at point (-498.896,-367.897,3811.32) 08:01:04 direction: (0.574521,-0.799023,0.177447). 08:01:04 Potential geometry or navigation problem ! 08:01:04 Trying pushing it of 1e-07 mm ...Potential overlap in geometry! 08:01:04 08:01:04 *** This is just a warning message. *** 08:01:04 -------- WWWW -------- G4Exception-END --------- WWWW ------- 08:01:04 08:17:01 ISFG4SimSvc INFO Event nr. 138 took 1252 s. New average 1398 +- 32.4 08:17:02 AthenaEventLoopMgr INFO ===>>> done processing event #9782938, run #284500 138 events processed so far <<<=== 08:17:02 AthenaEventLoopMgr INFO ===>>> start processing event #9782939, run #284500 138 events processed so far <<<=== |
Send message Joined: 15 Jun 08 Posts: 2443 Credit: 231,217,416 RAC: 121,378 |
Atlas-native (one CPU) runs now for 2 days and 17 hours: Total number of events: 200 already finished: 138 average: 1398 s/event Estimated total calculation time: 200 * 1398 s = 279600 s (3 d 5 h 40 min) Estimated time left: (200 - 138) * 1398 s = 86676 s (1 d 0 h 5 min) Uncertainty: (200 - 138) * 32.4 s * 3 = 6027 s (1 h 40 min) Event calculation times > 1000 s are rather high but not unusual. I'm currently running a task with event calculation times between 70 s and 733 s. <edit> corrected: uncertainty </edit> |
Send message Joined: 2 May 07 Posts: 2152 Credit: 161,132,926 RAC: 56,492 |
Perfect Computezrmle and this without your RDP-Program for Atlas-VM ;-)). Deadline is Dec.18th 14 hour UTC for this native Task. |
©2024 CERN