Message boards : ATLAS application : ATLAS native version 2.73
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 40833 - Posted: 7 Dec 2019, 13:16:45 UTC - in response to Message 40829.  

Thanks that could be what i need. I will check this.
ID: 40833 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 40838 - Posted: 7 Dec 2019, 14:48:11 UTC - in response to Message 40829.  

If I understand what CVMFS does, it just mounts a remote filesystem locally.
No, it also creates a local, client-side cache at $CVMFS_CACHE_BASE and actively manages it to keep within $CVMFS_QUOTA_LIMIT.
If CVMFS is using the defaults and du -hs /var/lib/cvmfs reports only 37MB, then something's broken - is something over-riding the defaults?

Usage should roughly match the CACHEUSE column from cvmfs_config stat.

I get
:~ > md5sum -b /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
3ae110eae0fafb7337079066ec64eb15 */cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
and don't see anything about "rootfs" in /tmp on my working Atlas-native box.
ID: 40838 · Report as offensive     Reply Quote
Luigi R.
Avatar

Send message
Joined: 7 Feb 14
Posts: 99
Credit: 5,180,005
RAC: 0
Message 40843 - Posted: 8 Dec 2019, 9:58:23 UTC
Last modified: 8 Dec 2019, 10:04:32 UTC

On Xubuntu 14.04.6 I have tried to run 4 single-thread tasks and it doesn't worked.

cvmfs_config probe returned 6 OKs.
First task started smoothly: https://lhcathome.cern.ch/lhcathome/result.php?resultid=254625003

Then other tasks got this error
check cvmfs return values are 0, 256
CVMFS not found, aborting the job
and failed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=254625437
And cvmfs_config probe was returning 1 or 2 OKs.


This system has a problem with libseccomp too. I will try to fix it by suggested solutions.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4972&postid=40840#40840


On Xubuntu 18.04.3 I have simultaneously started 4 tasks now and they are all running fine. There are 4 athena.py processes indeed.
cvmfs_config probe returns 6 OKs as expected.
ID: 40843 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40851 - Posted: 8 Dec 2019, 17:24:30 UTC - in response to Message 40843.  

I have about a 50% success rate in setting up native ATLAS on any given Ubuntu 18.04.3 machine. That is, sometimes it works and sometimes it doesn't.
I use the same setup routine I always use (it is copy and paste), except that now I don't install Singularity, but just use the CVMFS version.

My most recent (and strangest) case is on a Ryzen 2600 on which I just did a clean install of Ubuntu 18.04 yesterday. Native ATLAS did not work.
But I just happened to set up an additional BOINC instance on that machine, so I thought I would try native ATLAS on BOINC2. It worked.

It just gets curiouser and curiouser (don't try to translate that from English).
I have the feeling that it has something to do with permissions, but I give it all that I can that I know of.
ID: 40851 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40890 - Posted: 10 Dec 2019, 17:09:24 UTC - in response to Message 40851.  

I have the feeling that it has something to do with permissions, but I give it all that I can that I know of.

One thing I have found is that none of the files in the BOINC2 folder are shown as "locked".
But all of them in the BOINC folder (except for "ATLAS_job_2.54_x86_64-pc-linux-gnu.xml") are shown as "locked".

I can remove the lock with "sudo chmod -R 777 /var/lib/boinc-client", but that does not cause them to run.
And aborting those jobs and rebooting simply causes the lock to reappear on any newly-downloaded files.

I run BOINC 7.16.3 from LocutusOfBorg, granting permissions that work on all other projects (LHC and otherwise).
I think the problem needs to be addressed at the LHC end.
ID: 40890 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,920,396
RAC: 138,050
Message 40892 - Posted: 10 Dec 2019, 18:28:19 UTC - in response to Message 40890.  

How did you organize your filesystem/mountpoints?

What do you mean with "lock"?
Does it mean the user that tries to read a folder has no read/execute permission?

Did you check if a username/groupname is missing (old host/new host)?
Do all users/groups (old host/new host) have the same ids?
ID: 40892 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40893 - Posted: 10 Dec 2019, 20:18:10 UTC - in response to Message 40892.  
Last modified: 10 Dec 2019, 20:24:56 UTC

How did you organize your filesystem/mountpoints?

What do you mean with "lock"?
Does it mean the user that tries to read a folder has no read/execute permission?

Did you check if a username/groupname is missing (old host/new host)?
Do all users/groups (old host/new host) have the same ids?


By the standard Ubuntu 18.04.3 installation, along with BOINC 7.16.3 as mentioned.
"Lock" means it shows a padlock on the file.
The permissions show the owner as BOINC core client (read and write), and Group as boinc (read only).
I am a member of the "boinc" group.
EDIT: I am also a member of "root".
The Others access is read-only.
The files have execute permission.


BOINC: Computer ID: 10625929 https://lhcathome.cern.ch/lhcathome/results.php?hostid=10625929
BOINC2: Computer ID: 10626020 https://lhcathome.cern.ch/lhcathome/results.php?hostid=10626020
ID: 40893 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40894 - Posted: 10 Dec 2019, 23:43:56 UTC - in response to Message 40893.  

I am wondering whether BOINC 7.16.3 changed the permissions. I know I had native ATLAS working on BOINC 7.14.2 installations with no problems.
It seems that when I upgraded to 7.16.3 that things started falling apart.

Unfortunately, I have no way to test that now, since I don't have any of the earlier installations left on my machines.
Don't try to figure out a mess. Thanks for any input though.
ID: 40894 · Report as offensive     Reply Quote
lazlo_vii
Avatar

Send message
Joined: 20 Nov 19
Posts: 21
Credit: 1,074,330
RAC: 0
Message 40895 - Posted: 11 Dec 2019, 5:57:02 UTC
Last modified: 11 Dec 2019, 6:52:34 UTC

Users and groups can share names across multiple systems even if they have different UID's and GID's locally. Perhaps that is part of the issue?

EDIT: Now that I think about a bit more isn't "boinc" a system group? If so you shouldn't be able to access it's data. See "man pam", "man shadow", and "man login.defs" for more info.
ID: 40895 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40904 - Posted: 11 Dec 2019, 17:39:26 UTC - in response to Message 40895.  

EDIT: Now that I think about a bit more isn't "boinc" a system group? If so you shouldn't be able to access it's data. See "man pam", "man shadow", and "man login.defs" for more info.

That is entirely possible. I will be doing a new installation in a couple of days, and try BOINC 7.9.3. Maybe I can spot the difference with 7.16.3. It could be something else.
ID: 40904 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40930 - Posted: 13 Dec 2019, 0:26:42 UTC - in response to Message 40904.  

I will be doing a new installation in a couple of days, and try BOINC 7.9.3. Maybe I can spot the difference with 7.16.3. It could be something else.

Well I was able to get BOINC 7.9.3 working with native ATLAS, after I granted permissions to the boinc-client folder with:
"sudo chmod -R 777 /var/lib/boinc-client". The first group of work units before that errored out, but after a reboot following that change, the next group worked properly.

Also, I noted that before granting the permissions, there were 12 locked files in the boinc-client folder. That was before I attached to any project, so I would call them "system" files for BOINC.
After granting the permissions (and rebooting), there are only four locked files left (all the others are unlocked), plus a couple of locked files for each project.

I compared that to BOINC 7.16.3, and all the "system" files remain locked after a reboot following the granting of the permissions.
So it is apparent that 7.16.3 locks down more files. It seems that LHC and BOINC need to get together on this.
ID: 40930 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,920,396
RAC: 138,050
Message 40935 - Posted: 13 Dec 2019, 10:47:38 UTC - in response to Message 40930.  

Which files are locked?
Be so kind as to post the (filtered) output of "lslocks".
ID: 40935 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40937 - Posted: 13 Dec 2019, 12:29:19 UTC - in response to Message 40935.  
Last modified: 13 Dec 2019, 12:30:15 UTC

Be so kind as to post the (filtered) output of "lslocks".

Here it is for the BOINC 7.9.3 machine, where I am running both native ATLAS and CMS as my only BOINC projects.
I am no longer running LHC on the 7.16.3 machine, but could (in BOINC2) if you need it.

$ lslocks
COMMAND           PID  TYPE  SIZE MODE  M      START        END PATH
wrapper_26015_x  1296 POSIX       WRITE 0          0          0 /...
vboxwrapper_261  1293 POSIX       WRITE 0          0          0 /...
update-notifier  9602 FLOCK    0B WRITE 0          0          0 /run/user/1000/u
wrapper_26015_x  1291 POSIX       WRITE 0          0          0 /...
nvidia-persiste   964 POSIX       WRITE 0          0          0 /run...
cvmfs2           2002 POSIX       WRITE 0 1073741824 1073742335 /...
gnome-shell      2356 POSIX  1.1K WRITE 0          0          0 /home/jim/.nv/GL
gnome-shell      2356 POSIX 18.5K WRITE 0          0          0 /home/jim/.nv/GL
vboxwrapper_261  1300 POSIX       WRITE 0          0          0 /...
smbd-notifyd     7006 POSIX       WRITE 0          0          0 /run...
lpqd             7008 POSIX       WRITE 0          0          0 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
(unknown)       18914 FLOCK       WRITE 0          0          0 /...
rpcbind           905 FLOCK       WRITE 0          0          0 /run...
nmbd             1068 POSIX       READ  0          4          4 /run...
wrapper_26015_x  1290 POSIX       WRITE 0          0          0 /...
vboxwrapper_261  1299 POSIX       WRITE 0          0          0 /...
VBoxXPCOMIPCD    1419 POSIX       WRITE 0          0          0 /...
(unknown)        2533 FLOCK       WRITE 0          0          0 /...
whoopsie         1061 FLOCK       WRITE 0          0          0 /run/lock...
FAHClient        1255 POSIX       WRITE 0 1073741824 1073742335 /...
cvmfs2           2002 FLOCK       WRITE 0          0          0 /...
cvmfs2           2002 FLOCK       WRITE 0          0          0 /...
FAHClient        1253 POSIX       WRITE 0          0          0 /run...
wrapper_26015_x  1295 POSIX       WRITE 0          0          0 /...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       WRITE 0          0          0 /run...
smbd             6339 POSIX       WRITE 0          0          0 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
cleanupd         7007 POSIX       READ  0          4          4 /run...
cleanupd         7007 POSIX       READ  0          4          4 /run...
cron              937 FLOCK       WRITE 0          0          0 /run...
wrapper_26015_x  1297 POSIX       WRITE 0          0          0 /...
vboxwrapper_261  1294 POSIX       WRITE 0          0          0 /...
systemd-timesyn   904 FLOCK       WRITE 0          0          0 /run...
nmbd             1068 POSIX       READ  0          4          4 /run...
nmbd             1068 POSIX       WRITE 0          0          0 /run...
nmbd             1068 POSIX       WRITE 0          0          0 /run...
FahCore_21       1269 POSIX       WRITE 0          0          0 /...
wrapper_26015_x  1292 POSIX       WRITE 0          0          0 /...
vboxwrapper_261  1298 POSIX       WRITE 0          0          0 /...
cleanupd         7007 POSIX       WRITE 0          0          0 /run...
smbd             6339 POSIX       READ  0          4          4 /run...
ID: 40937 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,920,396
RAC: 138,050
Message 40940 - Posted: 13 Dec 2019, 13:17:16 UTC - in response to Message 40937.  

Here it is for the BOINC 7.9.3 machine, where I am running both native ATLAS and CMS as my only BOINC projects.

Sure?
Looks like there are some locks from Folding@home.
FAHClient        1253 POSIX       WRITE 0          0          0 /run...
FahCore_21       1269 POSIX       WRITE 0          0          0 /...



Beside that the locks set by BOINC processes don't look unusual (wrapper, vboxwrapper, cvmfs2, etc.)

What makes me wonder:
Do you store some of the BOINC data on samba shares (smbd, nmbd)?
If yes, you may try a different filesystem, e.g. ext3/4 or xfs.

Another issue that should be checked:
If you run additional BOINC clients, each of them must have it's own working directory.
ID: 40940 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40943 - Posted: 13 Dec 2019, 15:45:05 UTC - in response to Message 40940.  

Those are my only BOINC projects. I also run FAH, but only on the GPU, and it should not affect BOINC. I have done it on all my machines (both Windows and Ubuntu) for years without problems.

No, I don't store the BOINC data elsewhere. I am not surprised there isn't anything unusual on the BOINC 7.9.3. machine (after I had granted the extra permissions); it works..

But two questions:
1. Why do I have to grant the extra permissions at all?
2. What is wrong with BOINC 7.16.3? That is where the problem lies.

I have never needed additional directories for additional BOINC clients by the way. I often do it when running both CPU and GPU clients on the same project (e.g., MilkyWay or Einstein), where the BOINC scheduler has (another) of its scheduling problems. It keeps everything nice and separate.

It works for LHC too; I used the original BOINC instance for CMS, and BOINC2 for native ATLAS on the 7.16.3 machine.
On the BOINC 7.9.3 machine, I can run them both in the same instance. No real surprises after it is set up, it is just the hoops you have to jump through to get there.
ID: 40943 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,920,396
RAC: 138,050
Message 40944 - Posted: 13 Dec 2019, 17:04:24 UTC - in response to Message 40943.  

Those are my only BOINC projects. I also run FAH, but only on the GPU, and it should not affect BOINC. I have done it on all my machines (both Windows and Ubuntu) for years without problems.

OK


I have never needed additional directories

... must have it's own working directory.


I used the original BOINC instance for CMS, and BOINC2 for native ATLAS

There you state that you run 2 instances.
This requires 2 BOINC working directories.
Otherwise the client instances would fight against each other regarding the /slots/ directories.


What is wrong with BOINC 7.16.3? That is where the problem lies

Don't know.
So far I never tried 7.16.3.
ID: 40944 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40946 - Posted: 13 Dec 2019, 17:43:15 UTC - in response to Message 40944.  

What is wrong with BOINC 7.16.3? That is where the problem lies

Don't know.
So far I never tried 7.16.3.

I see you stayed on 7.14.2. That is a good idea.
It is LHC that needs to address it (maybe in conjunction with BOINC). I can only apply band-aids. Thanks for your input.
ID: 40946 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,091,089
RAC: 103,567
Message 40953 - Posted: 14 Dec 2019, 7:30:26 UTC

Atlas-native (one CPU) runs now for 2 days and 17 hours:
Interesting is: short time of a collision, but a long duration.
The log.EVNTtoHITS show for the moment:
07:51:58 AthenaEventLoopMgr INFO ===>>> done processing event #9782937, run #284500 137 events processed so far <<<===
07:51:58 AthenaEventLoopMgr INFO ===>>> start processing event #9782938, run #284500 137 events processed so far <<<===
08:00:44
08:00:44 -------- WWWW ------- G4Exception-START -------- WWWW -------
08:00:44 *** G4Exception : GeomNav1002
08:00:44 issued by : G4Navigator::ComputeStep()
08:00:44 Track stuck or not moving.
08:00:44 Track stuck, not moving for 10 steps
08:00:44 in volume -LArMgr::LAr::EMEC::Neg::InnerWheel- at point (234.576,650.26,-4235.54)
08:00:44 direction: (0.862717,-0.424166,0.275323).
08:00:44 Potential geometry or navigation problem !
08:00:44 Trying pushing it of 1e-07 mm ...Potential overlap in geometry!
08:00:44
08:00:44 *** This is just a warning message. ***
08:00:44 -------- WWWW -------- G4Exception-END --------- WWWW -------
08:00:44
08:01:04
08:01:04 -------- WWWW ------- G4Exception-START -------- WWWW -------
08:01:04 *** G4Exception : GeomNav1002
08:01:04 issued by : G4Navigator::ComputeStep()
08:01:04 Track stuck or not moving.
08:01:04 Track stuck, not moving for 10 steps
08:01:04 in volume -LArMgr::LAr::EMEC::Pos::InnerWheel- at point (-498.896,-367.897,3811.32)
08:01:04 direction: (0.574521,-0.799023,0.177447).
08:01:04 Potential geometry or navigation problem !
08:01:04 Trying pushing it of 1e-07 mm ...Potential overlap in geometry!
08:01:04
08:01:04 *** This is just a warning message. ***
08:01:04 -------- WWWW -------- G4Exception-END --------- WWWW -------
08:01:04
08:17:01 ISFG4SimSvc INFO Event nr. 138 took 1252 s. New average 1398 +- 32.4
08:17:02 AthenaEventLoopMgr INFO ===>>> done processing event #9782938, run #284500 138 events processed so far <<<===
08:17:02 AthenaEventLoopMgr INFO ===>>> start processing event #9782939, run #284500 138 events processed so far <<<===
ID: 40953 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,920,396
RAC: 138,050
Message 40954 - Posted: 14 Dec 2019, 9:05:00 UTC - in response to Message 40953.  
Last modified: 14 Dec 2019, 9:08:55 UTC

Atlas-native (one CPU) runs now for 2 days and 17 hours:
Interesting is: short time of a collision, but a long duration.
.
.
.
08:17:01 ISFG4SimSvc INFO Event nr. 138 took 1252 s. New average 1398 +- 32.4

Total number of events: 200
already finished: 138
average: 1398 s/event

Estimated total calculation time: 200 * 1398 s = 279600 s (3 d 5 h 40 min)

Estimated time left: (200 - 138) * 1398 s = 86676 s (1 d 0 h 5 min)
Uncertainty: (200 - 138) * 32.4 s * 3 = 6027 s (1 h 40 min)

Event calculation times > 1000 s are rather high but not unusual.
I'm currently running a task with event calculation times between 70 s and 733 s.

<edit>
corrected: uncertainty
</edit>
ID: 40954 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,091,089
RAC: 103,567
Message 40956 - Posted: 14 Dec 2019, 9:57:32 UTC - in response to Message 40954.  

Perfect Computezrmle and this without your RDP-Program for Atlas-VM ;-)).
Deadline is Dec.18th 14 hour UTC for this native Task.
ID: 40956 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : ATLAS application : ATLAS native version 2.73


©2024 CERN