Theory native fails with \"mountpoint for cgroup not found\"

Author	Message
Saturn911 Send message Joined: 3 Nov 12 Posts: 55 Credit: 138,571,379 RAC: 111,201	Message 45423 - Posted: 6 Oct 2021, 5:56:38 UTC Hello out there I'm on Manjaro (Arch) Linux. Here, for newer kernels, cgroup has changed from V1 to V2. This ends with "\"mountpoint for cgroup not found\" for Theory native while Atlas native runs o.k. So I have to set kernel parameter "systemd.unified_cgroup_hierarchy=0" to enable cgroup V1. Is it impossible to update Theory native for running on both (cgroup V1 and V2)? I'm not a programmer so kindly excuse this (maybe silly) question. ID: 45423 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,935,781 RAC: 127,896	Message 45424 - Posted: 6 Oct 2021, 6:49:10 UTC - in response to Message 45423. ... for newer kernels, cgroup has changed from V1 to V2. This ends with "\"mountpoint for cgroup not found\" for Theory native ... This affects all Linux systems using cgroups v2. Theory's suspend/resume does only support cgroups v1 (freezer). ATM there's no solution available as it would mean "somebody" would have to write the code to support cgroups v2. ... while Atlas native runs o.k. Unlike Theory ATLAS (native) - uses Singularity instead of Runc - does not support suspend/resume. ID: 45424 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 48 Credit: 62,750,938 RAC: 86,738	Message 47598 - Posted: 21 Dec 2022, 22:14:41 UTC - in response to Message 45424. ... for newer kernels, cgroup has changed from V1 to V2. This ends with "\"mountpoint for cgroup not found\" for Theory native ... This affects all Linux systems using cgroups v2. Theory's suspend/resume does only support cgroups v1 (freezer). ATM there's no solution available as it would mean "somebody" would have to write the code to support cgroups v2. ... while Atlas native runs o.k. Unlike Theory ATLAS (native) - uses Singularity instead of Runc - does not support suspend/resume. Sorry for digging out the old thread. I wonder if I am willing to forgo suspend/resume, could I make native Theory native work under cgroups v2? I suppose that means I could lose work or even end up with errors occasionally, but I never suspend/pause work on my server and I've configured task switch time to effectively never switch. So not able to suspend and resume doesn't seem to worth the immediate failure I am getting. Example WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=374269026 ID: 47598 · Reply Quote

Xipeos Send message Joined: 29 Oct 12 Posts: 2 Credit: 1,603,524 RAC: 0	Message 47616 - Posted: 25 Dec 2022, 6:32:22 UTC Most distros use cgroup v2 and this should not have been taken out of beta with only v1 support. I have nearly 100 failed tasks now just from this application. ID: 47616 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 48 Credit: 62,750,938 RAC: 86,738	Message 47617 - Posted: 26 Dec 2022, 1:52:33 UTC - in response to Message 47616. Most distros use cgroup v2 and this should not have been taken out of beta with only v1 support. I have nearly 100 failed tasks now just from this application. This isn't fair honestly. The wider adoption of cgroupv2 happened far after this application was released and it still works in vbox. Ideally, cgroupv2 should have been supported before mainstream distros start to switch over. At this point, I just hope we can get some quick hacks in if the cgroup part is not crucial for the application itself. Even when my system was on cgroupv1, I never bothered to setup suspend and resume. If the current cgroupv2 failure is just for that, I really hope I could just bypass that part. Just to be sure, from what I can find, none of the LHC@Home application source code is open, right? ID: 47617 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 48 Credit: 62,750,938 RAC: 86,738	Message 47618 - Posted: 26 Dec 2022, 4:09:38 UTC - in response to Message 47617. Just to be sure, from what I can find, none of the LHC@Home application source code is open, right? Turns out this question is irrelevant. This specific issue is only with the runc on cvmfs. The runc came with my distro had no problem starting containers on cgroupv2. So I hacked around in the cranky script that used to start native theory tasks and it now works without suspend/resume. Note that this is not tested in any other environments than my own, though it probably should work so long as the runc on distro can cope with cgroupv2. I ran two tasks and they both finished fine: https://lhcathome.cern.ch/lhcathome/result.php?resultid=374799901 https://lhcathome.cern.ch/lhcathome/result.php?resultid=374802093 Note the WARNING about runc in the output, which is what I added in the patch: https://pastebin.com/vpLvagEr. The link will expire in a week in case the patch has undesirable side effects. I can upload a permanent one if admin approve this. Hopefully just swapping the runc version doesn't have any side effects (like bogus results), but I'd like to get confirmation first. For the real fix, we may not even need any patch, if we can upgrade the runc in cvmfs. I don't know if the one in cvmfs is forked, but it seems to be old for sure. $ runc -v runc version 1.1.0-0ubuntu1.1 spec: 1.0.2-dev go: go1.18.1 libseccomp: 2.5.3 $ /cvmfs/grid.cern.ch/vc/containers/runc -v runc version spec: 1.0.0 If we go that route, obviously, the newer runc needs to be tested against other setup to ensure they didn't break cgroupv1, or any other workload. Suspend/resume on cgroupv2 would need additional work, but cranky has test for cgroup structure already. Since cgroupv2 will never have a matching structure, suspend would skip just fine, same as on cgroupv1 systems without the right cgroup structure. ID: 47618 · Reply Quote

Xipeos Send message Joined: 29 Oct 12 Posts: 2 Credit: 1,603,524 RAC: 0	Message 47619 - Posted: 26 Dec 2022, 6:08:15 UTC - in response to Message 47618. Last modified: 26 Dec 2022, 6:08:27 UTC Turns out this question is irrelevant. This specific issue is only with the runc on cvmfs. Thanks for looking into this. There's a `/cvmfs/grid.cern.ch/vc/containers/runc.new` that also seems to work fine. â¯ /cvmfs/grid.cern.ch/vc/containers/runc.new --version runc version spec: 1.0.2-dev go: go1.15.14 libseccomp: 2.5.4 ID: 47619 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 48 Credit: 62,750,938 RAC: 86,738	Message 47620 - Posted: 26 Dec 2022, 6:18:17 UTC - in response to Message 47619. Thanks for looking into this. There's a `/cvmfs/grid.cern.ch/vc/containers/runc.new` that also seems to work fine. Perfect. That's an easier patch to maintain. Perhaps someone is aware of the problem and testing fix already. I can only hope a fix for everyone is coming soon. ID: 47620 · Reply Quote

wanderphx Send message Joined: 15 Mar 20 Posts: 5 Credit: 223,800 RAC: 0	Message 47799 - Posted: 27 Feb 2023, 15:57:48 UTC I am getting this same/similar issue on my Ubuntu 22.04 LTS Virtualbox. Can someone describe how to upgrade the runc version, and/or if this is a valid fix that doesn't cause data processing errors? <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 09:59:06 (17235): wrapper (7.15.26016): starting 09:59:06 (17235): wrapper (7.15.26016): starting 09:59:06 (17235): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 () 09:59:06 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Detected Theory App 09:59:06 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Checking CVMFS. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Checking runc. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Creating the filesystem. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Updating config.json. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Running Container 'runc'. container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mountpoint for cgroup not found\"" 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Container 'runc' finished with status code 1. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Preparing output. 09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [ERROR] No output found. 09:59:17 (17235): cranky exited; CPU time 0.970083 09:59:17 (17235): app exit status: 0xce 09:59:17 (17235): called boinc_finish(195) ID: 47799 · Reply Quote

Tianyi Zhang Send message Joined: 19 Feb 22 Posts: 2 Credit: 2,900,720 RAC: 0	Message 48074 - Posted: 9 May 2023, 15:39:12 UTC Last modified: 9 May 2023, 16:01:24 UTC find a way to run native theory in debian 12 "bookworm". boinc 7.20.5 1.go to boincdir/projects/lhcathome.cern.ch_lhcathome and edit cranky-0.0.32, change /cvmfs/grid.cern.ch/vc/containers/runc to /cvmfs/grid.cern.ch/vc/containers/runc.new, so runc use cgroupv2 2.add GRUB_CMDLINE_LINUX_DEFAULT="vsyscall=emulate quiet" to /etc/default/grub, run sudo update-grub then restart, as in https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5580#44241 3.now it should be running ID: 48074 · Reply Quote

rushmash Send message Joined: 19 Nov 14 Posts: 2 Credit: 2,250,744 RAC: 1,478	Message 48075 - Posted: 9 May 2023, 23:10:58 UTC I can confirm, simply replacing all occurrences of 'containers/runc' with 'containers/runc.new' in /var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 fixed the issue on Fedora 38. No restart was needed. Thanks @Tianyi Zhang! ID: 48075 · Reply Quote

Tianyi Zhang Send message Joined: 19 Feb 22 Posts: 2 Credit: 2,900,720 RAC: 0	Message 48076 - Posted: 10 May 2023, 1:36:58 UTC - in response to Message 48075. Last modified: 10 May 2023, 1:37:29 UTC It seems like whenever boinc contacts lhc, it will check file size of cranky-0.0.32, if not matched, it will be redownloaded. May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Checking presence of 44 project files May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [LHC@home] File projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 has wrong size: expected 6010, got 6030 May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Using proxy info from GUI May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 Initialization completed May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Suspending GPU computation - computer is in use May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [climateprediction.net] Fetching scheduler list May 10 09:22:53 tyz-computer boinc[13726]: 10-May-2023 09:22:53 [LHC@home] Started download of cranky-0.0.32 May 10 09:22:55 tyz-computer boinc[13726]: 10-May-2023 09:22:55 [LHC@home] Finished download of cranky-0.0.32 for me the fix is to edit cranky-0.0.32 so that its size remains 6010 ID: 48076 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2220 Credit: 173,696,209 RAC: 24,770	Message 48113 - Posted: 18 May 2023, 6:30:01 UTC CentOS9-VM Stream, CVMFS folder in slot Number? 07:31:39 CEST +02:00 2023-05-18: cranky-0.0.32: [INFO] Running Container 'runc'. container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mountpoint for cgroup not found\"" 07:31:39 CEST +02:00 2023-05-18: cranky-0.0.32: [INFO] Container 'runc' finished with status code 1. ID: 48113 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0	Message 48455 - Posted: 17 Aug 2023, 7:58:36 UTC - in response to Message 48437. Is this just a case of making cranky use runc.new instead of runc? ID: 48455 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,935,781 RAC: 127,896	Message 48456 - Posted: 17 Aug 2023, 9:10:36 UTC - in response to Message 48455. The issue starts at the cgroups basement. When cranky was written CERN used a Linux version with cgroups v1. Meanwhile most Linux versions are using cgroups v2 (sometimes together with v1). Cranky tests against the existence of cgroup's freezer (v1) which is used for suspend/resume but not available under v2. If the tests don't succeed the tasks report an error. Some users patch cranky and/or their Linux basement to skip the tests which allows tasks to run. Unfortunately this needs active babysitting since BOINC recognises the patch and overwrites it with the original version under certain circumstances. The runc version also plays a role. The CERN server provides 2 different versions and there can be other versions installed locally, e.g. those provided by the Linux vendor. My suggestion would be to revise cranky to - enable it to work under cgroups v2 (a must) - find a solution for suspend/resume under cgroups v2 (may need a recent runc and a recent Linux kernel) - keep the older methods in case the client computer runs an older OS ID: 48456 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2220 Credit: 173,696,209 RAC: 24,770	Message 48457 - Posted: 17 Aug 2023, 9:24:30 UTC - in response to Message 48455. Seeing two CentOS9-VM with different cranky update One with cranky-0.0.32 from 27.08.22, the other with cranky-0.0.32 from 18.03.23 ID: 48457 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,935,781 RAC: 127,896	Message 48458 - Posted: 17 Aug 2023, 9:35:51 UTC - in response to Message 48457. In both cases the date when your computer downloaded a fresh copy. ID: 48458 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2220 Credit: 173,696,209 RAC: 24,770	Message 48459 - Posted: 17 Aug 2023, 10:30:25 UTC - in response to Message 48458. In both cases the date when your computer downloaded a fresh copy. In both CentOS9-VM this are the Info's from Properties in Filemanager! ID: 48459 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0	Message 48460 - Posted: 17 Aug 2023, 14:45:50 UTC - in response to Message 48456. Thanks for the detailed explanation, I will set aside some time to address this. ID: 48460 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0	Message 48465 - Posted: 21 Aug 2023, 9:59:56 UTC - in response to Message 48460. I am currently testing and update on the Dev project https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=630#8136. ID: 48465 · Reply Quote

LHC@home