Message boards :
Theory Application :
Theory Native issues with cgroups [SOLVED]
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 May 10 Posts: 6 Credit: 4,169,522 RAC: 0 |
I've gone to some great lengths to get Theory Native running on Gentoo Linux, including rewriting the abandoned CVMFS ebuild with the eventual goal of taking over proxy maintainership and making the bits available to other Gentoo users. That all seems to be working. I've completed and validated several TheoryN tasks now, but one problem is still troubling me. Control groups, and suspend/resume support. I followed the instructions to create the cgroup hierarchies for the boinc user, in particular the http://lhcathome.cern.ch/lhcathome/download/create-boinc-cgroup script. There was an issue, initially: on my distribution with systemd, and probably others, the cgroup root tmpfs filesystem at /sys/fs/cgroup is mounted read-only at boot time, so the script bails. Easily handled by remounting the filesystem read-write. Every TheoryN task I run, however, looks like this: 21:14:47 (14185): wrapper (7.15.26016): starting 21:14:47 (14185): wrapper (7.15.26016): starting 21:14:47 (14185): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.29 () 21:14:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Detected TheoryN App 21:14:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Checking CVMFS. 21:14:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Checking runc. 21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Creating the filesystem. 21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Creating cgroup for slot 18 mkdir: cannot create directory ‘/sys/fs/cgroup/freezer/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/cpuset/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/devices/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/memory/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/cpu,cpuacct/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/pids/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/hugetlb/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/net_cls/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/net_prio/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/perf_event/boinc/18’: Read-only file system mkdir: cannot create directory ‘/sys/fs/cgroup/freezer/boinc/18’: Read-only file system 21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Updating config.json. 21:14:52 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Running Container 'runc'. 21:14:55 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] ===> [runRivet] Thu Aug 22 01:14:54 UTC 2019 [boinc pp jets 7000 80,-,1760 - pythia8 8.235 default 100000 90] 21:18:41 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Pausing container TheoryN_2279-786411-90_0. no such directory for freezer.state 21:18:47 EDT -04:00 2019-08-21: cranky-0.0.29: [INFO] Resuming container TheoryN_2279-786411-90_0. container not paused Now... this is strange. The errors from mkdir don't make sense: that's not a read-only file system. I already remounted it read-write. I can create those slot cgroup hierarchies myself, as the boinc user. I've spent a few hours trying to find out what could cause such an error, but to no avail. I have no explanation for why the task cannot create these hierarchies on its own. The job tends to succeed anyway, but that's not really all I'm after; I want to make all of the available features work reliably, including suspend/resume. Does anyone have any ideas? Control groups are fiddly... and upstream wants us using cgroups-v2 already, which use a unified filesystem instead of these hierarchical paths, so I have a bad feeling that I'm headed down a rabbit hole. Incidentally, I don't want to keep burning through TheoryN tasks while trying to troubleshoot and potentially failing them. Is there a way I can take one offline, outside of BOINC, and retry it a bunch of times without reporting it back to the job server until I get this nailed down? |
Send message Joined: 19 May 10 Posts: 6 Credit: 4,169,522 RAC: 0 |
Incidentally, the instructions provided -- in particular, the script we're given to add to the boinc-client.service unit file as an ExecStartPre hook -- don't seem to be the right way to create the control group filesystem hierarchies for the boinc user. The reason is that the unit file, if correctly configured, will run all of its tasks as the user "boinc," which ordinarily will not have permissions to create these locations. Either root must do the work, or the script doing the work on behalf of the boinc user must have the setuid bit set and allow the boinc user to execute it (not a super good idea). Fortunately, if you are using systemd, there is already a more appropriate way to manage such filesystem objects. The ones we need are, essentially, temporary files in a virtual (tmpfs) filesystem. Systemd's mechanism for managing temporary files is systemd-tmpfiles, configured in /etc/tmpfiles.d. It is responsible for making sure that certain temporary files and directories are present and have the correct modes set at boot time, and for cleaning them up when appropriate. This is what we want to use. Therefore, I have created a tmpfiles.d drop-in configuration that creates and correctly sets permissions on each of the cgroup hierarchies required for the boinc user, in a way that is functionally isomorphic to the script provided by CERN. # /etc/tmpfiles.d/boinc.conf # Type Path Mode User Group d /sys/fs/cgroup/freezer/boinc 0775 root boinc f /sys/fs/cgroup/freezer/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/freezer/boinc/tasks 0664 root boinc f /sys/fs/cgroup/freezer/boinc/freezer.state 0664 root boinc d /sys/fs/cgroup/cpuset/boinc 0775 root boinc f /sys/fs/cgroup/cpuset/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/cpuset/boinc/tasks 0664 root boinc f /sys/fs/cgroup/cpuset/boinc/cpuset.mems 0664 root boinc f /sys/fs/cgroup/cpuset/boinc/cpuset.cpus 0664 root boinc d /sys/fs/cgroup/devices/boinc 0775 root boinc f /sys/fs/cgroup/devices/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/devices/boinc/tasks 0664 root boinc d /sys/fs/cgroup/pids/boinc 0775 root boinc f /sys/fs/cgroup/pids/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/pids/boinc/tasks 0664 root boinc d /sys/fs/cgroup/hugetlb/boinc 0775 root boinc f /sys/fs/cgroup/hugetlb/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/hugetlb/boinc/tasks 0664 root boinc d /sys/fs/cgroup/cpu,cpuacct/boinc 0775 root boinc f /sys/fs/cgroup/cpu,cpuacct/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/cpu,cpuacct/boinc/tasks 0664 root boinc d /sys/fs/cgroup/perf_event/boinc 0775 root boinc f /sys/fs/cgroup/perf_event/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/perf_event/boinc/tasks 0664 root boinc d /sys/fs/cgroup/net_cls,net_prio/boinc 0775 root boinc f /sys/fs/cgroup/net_cls,net_prio/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/net_cls,net_prio/boinc/tasks 0664 root boinc d /sys/fs/cgroup/blkio/boinc 0775 root boinc f /sys/fs/cgroup/blkio/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/blkio/boinc/tasks 0664 root boinc d /sys/fs/cgroup/memory/boinc 0775 root boinc f /sys/fs/cgroup/memory/boinc/cgroup.procs 0664 root boinc f /sys/fs/cgroup/memory/boinc/tasks 0664 root boinc Please let this be my small contribution toward making other volunteers' lives slightly easier in this regard, for the permissions issues implicit in doing it the other way may be somewhat frustrating to the novice. |
Send message Joined: 19 May 10 Posts: 6 Credit: 4,169,522 RAC: 0 |
So... I had the thought to look at cranky. Cranky is a bash script. Cranky does a few things, on the host Linux machine, to set up the runtime environment for the container it eventually spins up. Among the functions it executes is one called create_cgroup(), which basically does nothing but run mkdir in a loop of control group hierarchy names. Sounds simple so far. But... here's what I don't understand. Cranky is running as the boinc user. Cranky is just a child of the task process. Cranky runs in his slot folder, and he does his thing, but he prints out these "Read-only file system" errors from mkdir inside the create_cgroup() function like he's not looking at the same filesystem I am. And if I su to boinc, and if I run that very same function in my terminal while inside a directory named like a slot number, I get a pretty control group hierarchy just where I am supposed to. This is really, really where I need one of the project developers to say something. Why, for the love of all that is breaking my brain, would cranky's execution environment see a read-only file system when I see one that is perfectly writable? Here. I'll prove it. boinc@pygoscelis ~ $ mount | grep /sys/ tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755) cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd) efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,noatime) bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=46,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=16318) configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime) fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime) See? All the /sys/** file systems are mounted with the 'rw' option. Including, and especially, all the control group file systems. There are no read-only file systems here. I'm missing something. I know I'm missing something, and I think it is somehow particular to BOINC, but I don't know what I'm missing. Please help. |
Send message Joined: 19 May 10 Posts: 6 Credit: 4,169,522 RAC: 0 |
Anyone? Please? I've put an awful lot of work into getting this to function on my distribution, but I can't share the payoff with anyone else if that work is incomplete. I need to be able to get the suspend/resume feature to do its job, and it is obviously not doing that. If someone might at least offer some pointers for running a job standalone, detached from boinc, so that I may debug the container and its interactions with the host filesystem repeatably, I could try to make some progress on my own. |
Send message Joined: 19 May 10 Posts: 6 Credit: 4,169,522 RAC: 0 |
I figured it out. It has everything to do with how you run your BOINC client, and whether you use systemd. Moderators: if you would, please, mark this thread as [SOLVED]. If the BOINC client is installed as a distribution package, and the distribution uses systemd, the distribution may ship the BOINC client with the upstream unit file for running the client as a service. This unit file contains the sandboxing option ProtectControlGroups=true. This option does what it sounds like it does: it protects control groups from processes started by the service. To prevent modification, it exposes the /sys/fs/cgroup file system tree to these processes as read-only. Thus, BOINC tasks run by a client started as a service configured this way cannot do exactly the thing Cranky -- LHC@Home's wrapper for the runc container process -- is trying to do, and its setup the per-slot control groups is bound to fail. This is very important for anyone who wants suspend/resume to actually work. You can override this option, e.g., by doing `systemctl edit boinc-client.service` and putting the following chunk into the override file this command creates: [Service] ProtectControlGroups=false This will allow BOINC to modify control groups in parts of the tree where it is permitted to write, assuming the system administrator has elected to mount the cgroups filesystem read-write by redefining it as such -- since the default is that it's probably globally read-only. If that is true, then add this to /etc/fstab (the "rw" option is the important part): tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,mode=755 0 0 If I may tag Monsieur Laurence, I would kindly advise adding this small piece of information to the Native Theory Application Setup (Linux only) thread, since the pinned post mentions systemd and implies its use in the recommended setup. Though this will probably only be useful to the most technical of users, if one is trying to run theoretical particle physics simulations on Linux using containers on a systemd machine with an AUFS kernel and the most recent builds of Singularity, one is probably already among the more technical BOINC users in the world. Now that everything tests successfully, I might finally be able to submit my work to Gentoo in the form of more current ebuilds, and take over maintainership of our CVMFS package in particular. I dearly hope this is useful to someone. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Great work to investigate this issue Nethershaw. I appreciate it. I jump in and test it. It should indeed be added into main guide. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
I have edit with systemctl edit boinc-client.service And added line into fstab tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,mode=755 0 0 Cgroup got mounted and boinc.client is added also freezer listed and got info to log and process and task:id listed in files to cgroup. It looks that part works. Suspend/Resume Runrivetlog pythia6 part of when suspend and resume 46600 events processed 46700 events processed 46800 events processed 46900 events processed 47000 events processed dumping histograms... 47100 events processed 47200 events processed 47300 events processed 47400 events processed Eventlog 19:52:13 (5580): wrapper (7.15.26016): starting 19:52:13 (5580): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.31 () 19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Detected Theory App 19:52:13 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking CVMFS. 19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Checking runc. 19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating the filesystem. 19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Creating cgroup for slot 74 19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Updating config.json. 19:52:17 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Running Container 'runc'. 19:52:19 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] ===> [runRivet] Fri Feb 7 18:52:18 UTC 2020 [boinc ppbar jets 1960 140 - pythia6 6.428 393 100000 22] 20:11:12 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Pausing container Theory_2363-878402-22_0. 20:13:26 CET +01:00 2020-02-07: cranky-0.0.31: [INFO] Resuming container Theory_2363-878402-22_0. Work for application and wrapper resumed properly in short time task resume at last state. But at shutdown of boinc service there is problem that process is uninterruptible. A start boinc service without systemctl would start second process tree with init 2 that is uninterruptible and not started. Control Group /boinc/74 (blkio,cpu,cpuacct,cpuset,device,freezer,hugetlb,memory,net_cls,net_prio,perf_event.pids),/system.slice/boinc-client.service (,systemd) At reboot it wiped runrivet.log and task also start from 0%. My setup would probably not be able to handle a shutdown or set it wrong. I did a simple test and and it could probably be able to workaround to get it to save state process. |
©2024 CERN