Message boards :
Theory Application :
Theory native fails with \"mountpoint for cgroup not found\"
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Nov 12 Posts: 54 Credit: 133,413,757 RAC: 185,023 |
Hello out there I'm on Manjaro (Arch) Linux. Here, for newer kernels, cgroup has changed from V1 to V2. This ends with "\"mountpoint for cgroup not found\" for Theory native while Atlas native runs o.k. So I have to set kernel parameter "systemd.unified_cgroup_hierarchy=0" to enable cgroup V1. Is it impossible to update Theory native for running on both (cgroup V1 and V2)? I'm not a programmer so kindly excuse this (maybe silly) question. |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,121,644 RAC: 172,211 |
... for newer kernels, cgroup has changed from V1 to V2. This affects all Linux systems using cgroups v2. Theory's suspend/resume does only support cgroups v1 (freezer). ATM there's no solution available as it would mean "somebody" would have to write the code to support cgroups v2. ... while Atlas native runs o.k. Unlike Theory ATLAS (native) - uses Singularity instead of Runc - does not support suspend/resume. |
Send message Joined: 14 Sep 08 Posts: 46 Credit: 59,187,848 RAC: 78,992 |
... for newer kernels, cgroup has changed from V1 to V2. Sorry for digging out the old thread. I wonder if I am willing to forgo suspend/resume, could I make native Theory native work under cgroups v2? I suppose that means I could lose work or even end up with errors occasionally, but I never suspend/pause work on my server and I've configured task switch time to effectively never switch. So not able to suspend and resume doesn't seem to worth the immediate failure I am getting. Example WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=374269026 |
Send message Joined: 29 Oct 12 Posts: 2 Credit: 1,603,524 RAC: 28 |
Most distros use cgroup v2 and this should not have been taken out of beta with only v1 support. I have nearly 100 failed tasks now just from this application. |
Send message Joined: 14 Sep 08 Posts: 46 Credit: 59,187,848 RAC: 78,992 |
Most distros use cgroup v2 and this should not have been taken out of beta with only v1 support. I have nearly 100 failed tasks now just from this application. This isn't fair honestly. The wider adoption of cgroupv2 happened far after this application was released and it still works in vbox. Ideally, cgroupv2 should have been supported before mainstream distros start to switch over. At this point, I just hope we can get some quick hacks in if the cgroup part is not crucial for the application itself. Even when my system was on cgroupv1, I never bothered to setup suspend and resume. If the current cgroupv2 failure is just for that, I really hope I could just bypass that part. Just to be sure, from what I can find, none of the LHC@Home application source code is open, right? |
Send message Joined: 14 Sep 08 Posts: 46 Credit: 59,187,848 RAC: 78,992 |
Just to be sure, from what I can find, none of the LHC@Home application source code is open, right? Turns out this question is irrelevant. This specific issue is only with the runc on cvmfs. The runc came with my distro had no problem starting containers on cgroupv2. So I hacked around in the cranky script that used to start native theory tasks and it now works without suspend/resume. Note that this is not tested in any other environments than my own, though it probably should work so long as the runc on distro can cope with cgroupv2. I ran two tasks and they both finished fine: https://lhcathome.cern.ch/lhcathome/result.php?resultid=374799901 https://lhcathome.cern.ch/lhcathome/result.php?resultid=374802093 Note the WARNING about runc in the output, which is what I added in the patch: https://pastebin.com/vpLvagEr. The link will expire in a week in case the patch has undesirable side effects. I can upload a permanent one if admin approve this. Hopefully just swapping the runc version doesn't have any side effects (like bogus results), but I'd like to get confirmation first. For the real fix, we may not even need any patch, if we can upgrade the runc in cvmfs. I don't know if the one in cvmfs is forked, but it seems to be old for sure. $ runc -v runc version 1.1.0-0ubuntu1.1 spec: 1.0.2-dev go: go1.18.1 libseccomp: 2.5.3 $ /cvmfs/grid.cern.ch/vc/containers/runc -v runc version spec: 1.0.0 If we go that route, obviously, the newer runc needs to be tested against other setup to ensure they didn't break cgroupv1, or any other workload. Suspend/resume on cgroupv2 would need additional work, but cranky has test for cgroup structure already. Since cgroupv2 will never have a matching structure, suspend would skip just fine, same as on cgroupv1 systems without the right cgroup structure. |
Send message Joined: 29 Oct 12 Posts: 2 Credit: 1,603,524 RAC: 28 |
Turns out this question is irrelevant. This specific issue is only with the runc on cvmfs. Thanks for looking into this. There's a `/cvmfs/grid.cern.ch/vc/containers/runc.new` that also seems to work fine. ⯠/cvmfs/grid.cern.ch/vc/containers/runc.new --version runc version spec: 1.0.2-dev go: go1.15.14 libseccomp: 2.5.4 |
Send message Joined: 14 Sep 08 Posts: 46 Credit: 59,187,848 RAC: 78,992 |
Thanks for looking into this. There's a `/cvmfs/grid.cern.ch/vc/containers/runc.new` that also seems to work fine. Perfect. That's an easier patch to maintain. Perhaps someone is aware of the problem and testing fix already. I can only hope a fix for everyone is coming soon. |
Send message Joined: 15 Mar 20 Posts: 5 Credit: 223,800 RAC: 0 |
I am getting this same/similar issue on my Ubuntu 22.04 LTS Virtualbox. Can someone describe how to upgrade the runc version, and/or if this is a valid fix that doesn't cause data processing errors? <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 09:59:06 (17235): wrapper (7.15.26016): starting 09:59:06 (17235): wrapper (7.15.26016): starting 09:59:06 (17235): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 () 09:59:06 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Detected Theory App 09:59:06 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Checking CVMFS. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Checking runc. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Creating the filesystem. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Updating config.json. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Running Container 'runc'. container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mountpoint for cgroup not found\"" 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Container 'runc' finished with status code 1. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Preparing output. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [ERROR] No output found. 09:59:17 (17235): cranky exited; CPU time 0.970083 09:59:17 (17235): app exit status: 0xce 09:59:17 (17235): called boinc_finish(195) |
Send message Joined: 19 Feb 22 Posts: 2 Credit: 2,900,720 RAC: 0 |
find a way to run native theory in debian 12 "bookworm". boinc 7.20.5 1.go to boincdir/projects/lhcathome.cern.ch_lhcathome and edit cranky-0.0.32, change /cvmfs/grid.cern.ch/vc/containers/runc to /cvmfs/grid.cern.ch/vc/containers/runc.new, so runc use cgroupv2 2.add GRUB_CMDLINE_LINUX_DEFAULT="vsyscall=emulate quiet"to /etc/default/grub, run sudo update-grubthen restart, as in https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5580#44241 3.now it should be running |
Send message Joined: 19 Nov 14 Posts: 2 Credit: 2,114,911 RAC: 3 |
I can confirm, simply replacing all occurrences of 'containers/runc' with 'containers/runc.new' in /var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 fixed the issue on Fedora 38. No restart was needed. Thanks @Tianyi Zhang! |
Send message Joined: 19 Feb 22 Posts: 2 Credit: 2,900,720 RAC: 0 |
It seems like whenever boinc contacts lhc, it will check file size of cranky-0.0.32, if not matched, it will be redownloaded. May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Checking presence of 44 project files May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [LHC@home] File projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 has wrong size: expected 6010, got 6030 May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Using proxy info from GUI May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 Initialization completed May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Suspending GPU computation - computer is in use May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [climateprediction.net] Fetching scheduler list May 10 09:22:53 tyz-computer boinc[13726]: 10-May-2023 09:22:53 [LHC@home] Started download of cranky-0.0.32 May 10 09:22:55 tyz-computer boinc[13726]: 10-May-2023 09:22:55 [LHC@home] Finished download of cranky-0.0.32 for me the fix is to edit cranky-0.0.32 so that its size remains 6010 |
Send message Joined: 2 May 07 Posts: 2174 Credit: 171,755,701 RAC: 172,737 |
CentOS9-VM Stream, CVMFS folder in slot Number? 07:31:39 CEST +02:00 2023-05-18: cranky-0.0.32: [INFO] Running Container 'runc'. container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mountpoint for cgroup not found\"" 07:31:39 CEST +02:00 2023-05-18: cranky-0.0.32: [INFO] Container 'runc' finished with status code 1. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Is this just a case of making cranky use runc.new instead of runc? |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,121,644 RAC: 172,211 |
The issue starts at the cgroups basement. When cranky was written CERN used a Linux version with cgroups v1. Meanwhile most Linux versions are using cgroups v2 (sometimes together with v1). Cranky tests against the existence of cgroup's freezer (v1) which is used for suspend/resume but not available under v2. If the tests don't succeed the tasks report an error. Some users patch cranky and/or their Linux basement to skip the tests which allows tasks to run. Unfortunately this needs active babysitting since BOINC recognises the patch and overwrites it with the original version under certain circumstances. The runc version also plays a role. The CERN server provides 2 different versions and there can be other versions installed locally, e.g. those provided by the Linux vendor. My suggestion would be to revise cranky to - enable it to work under cgroups v2 (a must) - find a solution for suspend/resume under cgroups v2 (may need a recent runc and a recent Linux kernel) - keep the older methods in case the client computer runs an older OS |
Send message Joined: 2 May 07 Posts: 2174 Credit: 171,755,701 RAC: 172,737 |
Seeing two CentOS9-VM with different cranky update One with cranky-0.0.32 from 27.08.22, the other with cranky-0.0.32 from 18.03.23 |
Send message Joined: 15 Jun 08 Posts: 2473 Credit: 245,121,644 RAC: 172,211 |
In both cases the date when your computer downloaded a fresh copy. |
Send message Joined: 2 May 07 Posts: 2174 Credit: 171,755,701 RAC: 172,737 |
In both cases the date when your computer downloaded a fresh copy. In both CentOS9-VM this are the Info's from Properties in Filemanager! |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Thanks for the detailed explanation, I will set aside some time to address this. |
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
I am currently testing and update on the Dev project https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=630#8136. |
©2024 CERN