Message boards : Theory Application : Theory native fails with \"mountpoint for cgroup not found\"
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Saturn911

Send message
Joined: 3 Nov 12
Posts: 36
Credit: 114,124,756
RAC: 88,515
Message 45423 - Posted: 6 Oct 2021, 5:56:38 UTC

Hello out there
I'm on Manjaro (Arch) Linux.
Here, for newer kernels, cgroup has changed from V1 to V2.
This ends with "\"mountpoint for cgroup not found\" for Theory native
while Atlas native runs o.k.

So I have to set kernel parameter
"systemd.unified_cgroup_hierarchy=0"
to enable cgroup V1.

Is it impossible to update Theory native for running on both (cgroup V1 and V2)?
I'm not a programmer so kindly excuse this (maybe silly) question.
ID: 45423 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,952,802
RAC: 137,008
Message 45424 - Posted: 6 Oct 2021, 6:49:10 UTC - in response to Message 45423.  

... for newer kernels, cgroup has changed from V1 to V2.
This ends with "\"mountpoint for cgroup not found\" for Theory native ...

This affects all Linux systems using cgroups v2.
Theory's suspend/resume does only support cgroups v1 (freezer).
ATM there's no solution available as it would mean "somebody" would have to write the code to support cgroups v2.


... while Atlas native runs o.k.

Unlike Theory ATLAS (native)
- uses Singularity instead of Runc
- does not support suspend/resume.
ID: 45424 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,527,658
RAC: 40,399
Message 47598 - Posted: 21 Dec 2022, 22:14:41 UTC - in response to Message 45424.  

... for newer kernels, cgroup has changed from V1 to V2.
This ends with "\"mountpoint for cgroup not found\" for Theory native ...

This affects all Linux systems using cgroups v2.
Theory's suspend/resume does only support cgroups v1 (freezer).
ATM there's no solution available as it would mean "somebody" would have to write the code to support cgroups v2.


... while Atlas native runs o.k.

Unlike Theory ATLAS (native)
- uses Singularity instead of Runc
- does not support suspend/resume.

Sorry for digging out the old thread. I wonder if I am willing to forgo suspend/resume, could I make native Theory native work under cgroups v2?

I suppose that means I could lose work or even end up with errors occasionally, but I never suspend/pause work on my server and I've configured task switch time to effectively never switch. So not able to suspend and resume doesn't seem to worth the immediate failure I am getting. Example WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=374269026
ID: 47598 · Report as offensive     Reply Quote
Xipeos

Send message
Joined: 29 Oct 12
Posts: 2
Credit: 1,535,232
RAC: 1,215
Message 47616 - Posted: 25 Dec 2022, 6:32:22 UTC

Most distros use cgroup v2 and this should not have been taken out of beta with only v1 support. I have nearly 100 failed tasks now just from this application.
ID: 47616 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,527,658
RAC: 40,399
Message 47617 - Posted: 26 Dec 2022, 1:52:33 UTC - in response to Message 47616.  

Most distros use cgroup v2 and this should not have been taken out of beta with only v1 support. I have nearly 100 failed tasks now just from this application.

This isn't fair honestly. The wider adoption of cgroupv2 happened far after this application was released and it still works in vbox. Ideally, cgroupv2 should have been supported before mainstream distros start to switch over. At this point, I just hope we can get some quick hacks in if the cgroup part is not crucial for the application itself. Even when my system was on cgroupv1, I never bothered to setup suspend and resume. If the current cgroupv2 failure is just for that, I really hope I could just bypass that part.

Just to be sure, from what I can find, none of the LHC@Home application source code is open, right?
ID: 47617 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,527,658
RAC: 40,399
Message 47618 - Posted: 26 Dec 2022, 4:09:38 UTC - in response to Message 47617.  

Just to be sure, from what I can find, none of the LHC@Home application source code is open, right?

Turns out this question is irrelevant. This specific issue is only with the runc on cvmfs. The runc came with my distro had no problem starting containers on cgroupv2. So I hacked around in the cranky script that used to start native theory tasks and it now works without suspend/resume. Note that this is not tested in any other environments than my own, though it probably should work so long as the runc on distro can cope with cgroupv2.

I ran two tasks and they both finished fine:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=374799901
https://lhcathome.cern.ch/lhcathome/result.php?resultid=374802093
Note the WARNING about runc in the output, which is what I added in the patch: https://pastebin.com/vpLvagEr. The link will expire in a week in case the patch has undesirable side effects. I can upload a permanent one if admin approve this. Hopefully just swapping the runc version doesn't have any side effects (like bogus results), but I'd like to get confirmation first.

For the real fix, we may not even need any patch, if we can upgrade the runc in cvmfs. I don't know if the one in cvmfs is forked, but it seems to be old for sure.
$ runc -v
runc version 1.1.0-0ubuntu1.1
spec: 1.0.2-dev
go: go1.18.1
libseccomp: 2.5.3

$ /cvmfs/grid.cern.ch/vc/containers/runc -v
runc version spec: 1.0.0

If we go that route, obviously, the newer runc needs to be tested against other setup to ensure they didn't break cgroupv1, or any other workload. Suspend/resume on cgroupv2 would need additional work, but cranky has test for cgroup structure already. Since cgroupv2 will never have a matching structure, suspend would skip just fine, same as on cgroupv1 systems without the right cgroup structure.
ID: 47618 · Report as offensive     Reply Quote
Xipeos

Send message
Joined: 29 Oct 12
Posts: 2
Credit: 1,535,232
RAC: 1,215
Message 47619 - Posted: 26 Dec 2022, 6:08:15 UTC - in response to Message 47618.  
Last modified: 26 Dec 2022, 6:08:27 UTC

Turns out this question is irrelevant. This specific issue is only with the runc on cvmfs.


Thanks for looking into this. There's a `/cvmfs/grid.cern.ch/vc/containers/runc.new` that also seems to work fine.
❯ /cvmfs/grid.cern.ch/vc/containers/runc.new --version
runc version spec: 1.0.2-dev
go: go1.15.14
libseccomp: 2.5.4
ID: 47619 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 48,527,658
RAC: 40,399
Message 47620 - Posted: 26 Dec 2022, 6:18:17 UTC - in response to Message 47619.  

Thanks for looking into this. There's a `/cvmfs/grid.cern.ch/vc/containers/runc.new` that also seems to work fine.

Perfect. That's an easier patch to maintain. Perhaps someone is aware of the problem and testing fix already. I can only hope a fix for everyone is coming soon.
ID: 47620 · Report as offensive     Reply Quote
wanderphx
Avatar

Send message
Joined: 15 Mar 20
Posts: 5
Credit: 221,801
RAC: 0
Message 47799 - Posted: 27 Feb 2023, 15:57:48 UTC

I am getting this same/similar issue on my Ubuntu 22.04 LTS Virtualbox. Can someone describe how to upgrade the runc version, and/or if this is a valid fix that doesn't cause data processing errors?

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
09:59:06 (17235): wrapper (7.15.26016): starting
09:59:06 (17235): wrapper (7.15.26016): starting
09:59:06 (17235): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 ()
09:59:06 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Detected Theory App
09:59:06 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Checking CVMFS.
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Checking runc.
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Creating the filesystem.
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Updating config.json.
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Running Container 'runc'.
container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mountpoint for cgroup not found\""
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Container 'runc' finished with status code 1.
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [INFO] Preparing output.
09:59:16 EST -05:00 2023-02-27: cranky-0.0.32: [ERROR] No output found.
09:59:17 (17235): cranky exited; CPU time 0.970083
09:59:17 (17235): app exit status: 0xce
09:59:17 (17235): called boinc_finish(195)
ID: 47799 · Report as offensive     Reply Quote
Tianyi Zhang

Send message
Joined: 19 Feb 22
Posts: 2
Credit: 2,900,720
RAC: 0
Message 48074 - Posted: 9 May 2023, 15:39:12 UTC
Last modified: 9 May 2023, 16:01:24 UTC

find a way to run native theory in debian 12 "bookworm". boinc 7.20.5
1.go to boincdir/projects/lhcathome.cern.ch_lhcathome and edit cranky-0.0.32, change /cvmfs/grid.cern.ch/vc/containers/runc to /cvmfs/grid.cern.ch/vc/containers/runc.new, so runc use cgroupv2
2.add
GRUB_CMDLINE_LINUX_DEFAULT="vsyscall=emulate quiet"
to /etc/default/grub, run
sudo update-grub
then restart, as in https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5580#44241
3.now it should be running
ID: 48074 · Report as offensive     Reply Quote
rushmash

Send message
Joined: 19 Nov 14
Posts: 2
Credit: 2,103,893
RAC: 0
Message 48075 - Posted: 9 May 2023, 23:10:58 UTC

I can confirm, simply replacing all occurrences of 'containers/runc' with 'containers/runc.new' in /var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 fixed the issue on Fedora 38. No restart was needed.
Thanks @Tianyi Zhang!
ID: 48075 · Report as offensive     Reply Quote
Tianyi Zhang

Send message
Joined: 19 Feb 22
Posts: 2
Credit: 2,900,720
RAC: 0
Message 48076 - Posted: 10 May 2023, 1:36:58 UTC - in response to Message 48075.  
Last modified: 10 May 2023, 1:37:29 UTC

It seems like whenever boinc contacts lhc, it will check file size of cranky-0.0.32, if not matched, it will be redownloaded.
May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Checking presence of 44 project files
May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [LHC@home] File projects/lhcathome.cern.ch_lhcathome/cranky-0.0.32 has wrong size: expected 6010, got 6030
May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Using proxy info from GUI
May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 Initialization completed
May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [---] Suspending GPU computation - computer is in use
May 10 09:22:52 tyz-computer boinc[13726]: 10-May-2023 09:22:52 [climateprediction.net] Fetching scheduler list
May 10 09:22:53 tyz-computer boinc[13726]: 10-May-2023 09:22:53 [LHC@home] Started download of cranky-0.0.32
May 10 09:22:55 tyz-computer boinc[13726]: 10-May-2023 09:22:55 [LHC@home] Finished download of cranky-0.0.32

for me the fix is to edit cranky-0.0.32 so that its size remains 6010
ID: 48076 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,140,797
RAC: 105,338
Message 48113 - Posted: 18 May 2023, 6:30:01 UTC

CentOS9-VM Stream, CVMFS folder in slot Number?
07:31:39 CEST +02:00 2023-05-18: cranky-0.0.32: [INFO] Running Container 'runc'.
container_linux.go:336: starting container process caused "process_linux.go:293: applying cgroup configuration for process caused \"mountpoint for cgroup not found\""
07:31:39 CEST +02:00 2023-05-18: cranky-0.0.32: [INFO] Container 'runc' finished with status code 1.
ID: 48113 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 48455 - Posted: 17 Aug 2023, 7:58:36 UTC - in response to Message 48437.  

Is this just a case of making cranky use runc.new instead of runc?
ID: 48455 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,952,802
RAC: 137,008
Message 48456 - Posted: 17 Aug 2023, 9:10:36 UTC - in response to Message 48455.  

The issue starts at the cgroups basement.

When cranky was written CERN used a Linux version with cgroups v1.
Meanwhile most Linux versions are using cgroups v2 (sometimes together with v1).

Cranky tests against the existence of cgroup's freezer (v1) which is used for suspend/resume but not available under v2.
If the tests don't succeed the tasks report an error.
Some users patch cranky and/or their Linux basement to skip the tests which allows tasks to run.
Unfortunately this needs active babysitting since BOINC recognises the patch and overwrites it with the original version under certain circumstances.

The runc version also plays a role.
The CERN server provides 2 different versions and there can be other versions installed locally, e.g. those provided by the Linux vendor.


My suggestion would be to revise cranky to
- enable it to work under cgroups v2 (a must)
- find a solution for suspend/resume under cgroups v2 (may need a recent runc and a recent Linux kernel)
- keep the older methods in case the client computer runs an older OS
ID: 48456 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,140,797
RAC: 105,338
Message 48457 - Posted: 17 Aug 2023, 9:24:30 UTC - in response to Message 48455.  

Seeing two CentOS9-VM with different cranky update
One with cranky-0.0.32 from 27.08.22, the other with
cranky-0.0.32 from 18.03.23
ID: 48457 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,952,802
RAC: 137,008
Message 48458 - Posted: 17 Aug 2023, 9:35:51 UTC - in response to Message 48457.  

In both cases the date when your computer downloaded a fresh copy.
ID: 48458 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,140,797
RAC: 105,338
Message 48459 - Posted: 17 Aug 2023, 10:30:25 UTC - in response to Message 48458.  

In both cases the date when your computer downloaded a fresh copy.

In both CentOS9-VM this are the Info's from Properties in Filemanager!
ID: 48459 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 48460 - Posted: 17 Aug 2023, 14:45:50 UTC - in response to Message 48456.  

Thanks for the detailed explanation, I will set aside some time to address this.
ID: 48460 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 48465 - Posted: 21 Aug 2023, 9:59:56 UTC - in response to Message 48460.  

I am currently testing and update on the Dev project https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=630#8136.
ID: 48465 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Theory native fails with \"mountpoint for cgroup not found\"


©2024 CERN