Message boards : Theory Application : Theory Native issues with cgroups [SOLVED]
Message board moderation

To post messages, you must log in.

AuthorMessage
Sagittarius Lupus
Avatar

Send message
Joined: 19 May 10
Posts: 6
Credit: 4,169,522
RAC: 1
Message 39681 - Posted: 22 Aug 2019, 1:33:20 UTC

I've gone to some great lengths to get Theory Native running on Gentoo Linux, including rewriting the abandoned CVMFS ebuild with the eventual goal of taking over proxy maintainership and making the bits available to other Gentoo users. That all seems to be working. I've completed and validated several TheoryN tasks now, but one problem is still troubling me.

Control groups, and suspend/resume support.

I followed the instructions to create the cgroup hierarchies for the boinc user, in particular the http://lhcathome.cern.ch/lhcathome/download/create-boinc-cgroup script. There was an issue, initially: on my distribution with systemd, and probably others, the cgroup root tmpfs filesystem at /sys/fs/cgroup is mounted read-only at boot time, so the script bails. Easily handled by remounting the filesystem read-write.

Every TheoryN task I run, however, looks like this:

21:14:47 (14185): wrapper (7.15.26016): starting
21:14:47 (14185): wrapper (7.15.26016): starting
21:14:47 (14185): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.29 ()
21:14:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Detected TheoryN App
21:14:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Checking CVMFS.
21:14:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Checking runc.
21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Creating the filesystem.
21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Creating cgroup for slot 18
mkdir: cannot create directory ‘/sys/fs/cgroup/freezer/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/cpuset/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/devices/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/cpu,cpuacct/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/pids/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/hugetlb/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/net_cls/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/net_prio/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/perf_event/boinc/18’: Read-only file system
mkdir: cannot create directory ‘/sys/fs/cgroup/freezer/boinc/18’: Read-only file system
21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Updating config.json.
21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Running Container 'runc'.
21:14:55 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] ===> [runRivet] Thu Aug 22 01:14:54 UTC 2019 [boinc pp jets 7000 80,-,1760 - pythia8 8.235 default 100000 90]
21:18:41 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Pausing container TheoryN_2279-786411-90_0.
no such directory for freezer.state
21:18:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Resuming container TheoryN_2279-786411-90_0.
container not paused


Now... this is strange. The errors from mkdir don't make sense: that's not a read-only file system. I already remounted it read-write. I can create those slot cgroup hierarchies myself, as the boinc user. I've spent a few hours trying to find out what could cause such an error, but to no avail. I have no explanation for why the task cannot create these hierarchies on its own.

The job tends to succeed anyway, but that's not really all I'm after; I want to make all of the available features work reliably, including suspend/resume. Does anyone have any ideas? Control groups are fiddly... and upstream wants us using cgroups-v2 already, which use a unified filesystem instead of these hierarchical paths, so I have a bad feeling that I'm headed down a rabbit hole.

Incidentally, I don't want to keep burning through TheoryN tasks while trying to troubleshoot and potentially failing them. Is there a way I can take one offline, outside of BOINC, and retry it a bunch of times without reporting it back to the job server until I get this nailed down?
ID: 39681 · Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 19 May 10
Posts: 6
Credit: 4,169,522
RAC: 1
Message 39723 - Posted: 24 Aug 2019, 23:03:46 UTC

Incidentally, the instructions provided -- in particular, the script we're given to add to the boinc-client.service unit file as an ExecStartPre hook -- don't seem to be the right way to create the control group filesystem hierarchies for the boinc user. The reason is that the unit file, if correctly configured, will run all of its tasks as the user "boinc," which ordinarily will not have permissions to create these locations. Either root must do the work, or the script doing the work on behalf of the boinc user must have the setuid bit set and allow the boinc user to execute it (not a super good idea).

Fortunately, if you are using systemd, there is already a more appropriate way to manage such filesystem objects. The ones we need are, essentially, temporary files in a virtual (tmpfs) filesystem. Systemd's mechanism for managing temporary files is systemd-tmpfiles, configured in /etc/tmpfiles.d. It is responsible for making sure that certain temporary files and directories are present and have the correct modes set at boot time, and for cleaning them up when appropriate. This is what we want to use.

Therefore, I have created a tmpfiles.d drop-in configuration that creates and correctly sets permissions on each of the cgroup hierarchies required for the boinc user, in a way that is functionally isomorphic to the script provided by CERN.

# /etc/tmpfiles.d/boinc.conf
# Type  Path                                                    Mode    User    Group
d       /sys/fs/cgroup/freezer/boinc                            0775    root    boinc
f       /sys/fs/cgroup/freezer/boinc/cgroup.procs               0664    root    boinc
f       /sys/fs/cgroup/freezer/boinc/tasks                      0664    root    boinc
f       /sys/fs/cgroup/freezer/boinc/freezer.state              0664    root    boinc
d       /sys/fs/cgroup/cpuset/boinc                             0775    root    boinc
f       /sys/fs/cgroup/cpuset/boinc/cgroup.procs                0664    root    boinc
f       /sys/fs/cgroup/cpuset/boinc/tasks                       0664    root    boinc
f       /sys/fs/cgroup/cpuset/boinc/cpuset.mems                 0664    root    boinc
f       /sys/fs/cgroup/cpuset/boinc/cpuset.cpus                 0664    root    boinc
d       /sys/fs/cgroup/devices/boinc                            0775    root    boinc
f       /sys/fs/cgroup/devices/boinc/cgroup.procs               0664    root    boinc
f       /sys/fs/cgroup/devices/boinc/tasks                      0664    root    boinc
d       /sys/fs/cgroup/pids/boinc                               0775    root    boinc
f       /sys/fs/cgroup/pids/boinc/cgroup.procs                  0664    root    boinc
f       /sys/fs/cgroup/pids/boinc/tasks                         0664    root    boinc
d       /sys/fs/cgroup/hugetlb/boinc                            0775    root    boinc
f       /sys/fs/cgroup/hugetlb/boinc/cgroup.procs               0664    root    boinc
f       /sys/fs/cgroup/hugetlb/boinc/tasks                      0664    root    boinc
d       /sys/fs/cgroup/cpu,cpuacct/boinc                        0775    root    boinc
f       /sys/fs/cgroup/cpu,cpuacct/boinc/cgroup.procs           0664    root    boinc
f       /sys/fs/cgroup/cpu,cpuacct/boinc/tasks                  0664    root    boinc
d       /sys/fs/cgroup/perf_event/boinc                         0775    root    boinc
f       /sys/fs/cgroup/perf_event/boinc/cgroup.procs            0664    root    boinc
f       /sys/fs/cgroup/perf_event/boinc/tasks                   0664    root    boinc
d       /sys/fs/cgroup/net_cls,net_prio/boinc                   0775    root    boinc
f       /sys/fs/cgroup/net_cls,net_prio/boinc/cgroup.procs      0664    root    boinc
f       /sys/fs/cgroup/net_cls,net_prio/boinc/tasks             0664    root    boinc
d       /sys/fs/cgroup/blkio/boinc                              0775    root    boinc
f       /sys/fs/cgroup/blkio/boinc/cgroup.procs                 0664    root    boinc
f       /sys/fs/cgroup/blkio/boinc/tasks                        0664    root    boinc
d       /sys/fs/cgroup/memory/boinc                             0775    root    boinc
f       /sys/fs/cgroup/memory/boinc/cgroup.procs                0664    root    boinc
f       /sys/fs/cgroup/memory/boinc/tasks                       0664    root    boinc


Please let this be my small contribution toward making other volunteers' lives slightly easier in this regard, for the permissions issues implicit in doing it the other way may be somewhat frustrating to the novice.
ID: 39723 · Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 19 May 10
Posts: 6
Credit: 4,169,522
RAC: 1
Message 39817 - Posted: 4 Sep 2019, 0:52:06 UTC

So... I had the thought to look at cranky.

Cranky is a bash script.

Cranky does a few things, on the host Linux machine, to set up the runtime environment for the container it eventually spins up. Among the functions it executes is one called create_cgroup(), which basically does nothing but run mkdir in a loop of control group hierarchy names. Sounds simple so far.

But... here's what I don't understand. Cranky is running as the boinc user. Cranky is just a child of the task process. Cranky runs in his slot folder, and he does his thing, but he prints out these "Read-only file system" errors from mkdir inside the create_cgroup() function like he's not looking at the same filesystem I am. And if I su to boinc, and if I run that very same function in my terminal while inside a directory named like a slot number, I get a pretty control group hierarchy just where I am supposed to.

This is really, really where I need one of the project developers to say something. Why, for the love of all that is breaking my brain, would cranky's execution environment see a read-only file system when I see one that is perfectly writable? Here. I'll prove it.

boinc@pygoscelis ~ $ mount | grep /sys/
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,noatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=46,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=16318)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)

See? All the /sys/** file systems are mounted with the 'rw' option. Including, and especially, all the control group file systems. There are no read-only file systems here.

I'm missing something. I know I'm missing something, and I think it is somehow particular to BOINC, but I don't know what I'm missing. Please help.
ID: 39817 · Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 19 May 10
Posts: 6
Credit: 4,169,522
RAC: 1
Message 41484 - Posted: 5 Feb 2020, 23:49:12 UTC

Anyone? Please? I've put an awful lot of work into getting this to function on my distribution, but I can't share the payoff with anyone else if that work is incomplete. I need to be able to get the suspend/resume feature to do its job, and it is obviously not doing that.

If someone might at least offer some pointers for running a job standalone, detached from boinc, so that I may debug the container and its interactions with the host filesystem repeatably, I could try to make some progress on my own.
ID: 41484 · Report as offensive     Reply Quote
Sagittarius Lupus
Avatar

Send message
Joined: 19 May 10
Posts: 6
Credit: 4,169,522
RAC: 1
Message 41485 - Posted: 6 Feb 2020, 2:13:18 UTC

I figured it out. It has everything to do with how you run your BOINC client, and whether you use systemd. Moderators: if you would, please, mark this thread as [SOLVED].

If the BOINC client is installed as a distribution package, and the distribution uses systemd, the distribution may ship the BOINC client with the upstream unit file for running the client as a service. This unit file contains the sandboxing option ProtectControlGroups=true.

This option does what it sounds like it does: it protects control groups from processes started by the service. To prevent modification, it exposes the /sys/fs/cgroup file system tree to these processes as read-only. Thus, BOINC tasks run by a client started as a service configured this way cannot do exactly the thing Cranky -- LHC@Home's wrapper for the runc container process -- is trying to do, and its setup the per-slot control groups is bound to fail.

This is very important for anyone who wants suspend/resume to actually work. You can override this option, e.g., by doing `systemctl edit boinc-client.service` and putting the following chunk into the override file this command creates:

[Service]
ProtectControlGroups=false

This will allow BOINC to modify control groups in parts of the tree where it is permitted to write, assuming the system administrator has elected to mount the cgroups filesystem read-write by redefining it as such -- since the default is that it's probably globally read-only. If that is true, then add this to /etc/fstab (the "rw" option is the important part):

tmpfs  /sys/fs/cgroup  tmpfs  rw,nosuid,nodev,noexec,mode=755  0  0

If I may tag Monsieur Laurence, I would kindly advise adding this small piece of information to the Native Theory Application Setup (Linux only) thread, since the pinned post mentions systemd and implies its use in the recommended setup. Though this will probably only be useful to the most technical of users, if one is trying to run theoretical particle physics simulations on Linux using containers on a systemd machine with an AUFS kernel and the most recent builds of Singularity, one is probably already among the more technical BOINC users in the world.

Now that everything tests successfully, I might finally be able to submit my work to Gentoo in the form of more current ebuilds, and take over maintainership of our CVMFS package in particular. I dearly hope this is useful to someone.
ID: 41485 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 41493 - Posted: 7 Feb 2020, 17:45:14 UTC

Great work to investigate this issue Nethershaw. I appreciate it.

I jump in and test it. It should indeed be added into main guide.
ID: 41493 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 41494 - Posted: 7 Feb 2020, 20:38:31 UTC
Last modified: 7 Feb 2020, 20:50:18 UTC

I have edit with
systemctl edit boinc-client.service

And added line into fstab
tmpfs  /sys/fs/cgroup  tmpfs  rw,nosuid,nodev,noexec,mode=755  0  0


Cgroup got mounted and boinc.client is added also freezer listed and got info to log and process and task:id listed in files to cgroup. It looks that part works.

Suspend/Resume
Runrivetlog pythia6 part of when suspend and resume
46600 events processed
46700 events processed
46800 events processed
46900 events processed
47000 events processed
dumping histograms...
47100 events processed
47200 events processed
47300 events processed
47400 events processed


Eventlog
19:52:13 (5580): wrapper (7.15.26016): starting
19:52:13 (5580): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.31 ()
19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Detected Theory App
19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking CVMFS.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking runc.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating the filesystem.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating cgroup for slot 74
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Updating config.json.
19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Running Container 'runc'.
19:52:19 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] ===> [runRivet] Fri Feb  7 18:52:18 UTC 2020 [boinc ppbar jets 1960 140 - pythia6 6.428 393 100000 22]
20:11:12 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Pausing container Theory_2363-878402-22_0.
20:13:26 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Resuming container Theory_2363-878402-22_0.


Work for application and wrapper resumed properly in short time task resume at last state.

But at shutdown of boinc service there is problem that process is uninterruptible. A start boinc service without systemctl would start second process tree with init 2 that is uninterruptible and not started.
Control Group /boinc/74 (blkio,cpu,cpuacct,cpuset,device,freezer,hugetlb,memory,net_cls,net_prio,perf_event.pids),/system.slice/boinc-client.service (,systemd)

At reboot it wiped runrivet.log and task also start from 0%.


My setup would probably not be able to handle a shutdown or set it wrong. I did a simple test and and it could probably be able to workaround to get it to save state process.
ID: 41494 · Report as offensive     Reply Quote

Message boards : Theory Application : Theory Native issues with cgroups [SOLVED]


©2024 CERN