Re: Questions around cgroups, systemd, containers

Michal Koutný <mkoutny@xxxxxxxx> · Tue, 24 May 2022 16:33:30 +0200

Hello Lewis.

On Sat, May 21, 2022 at 07:02:14PM +0100, Lewis Gaul <lewis.gaul@xxxxxxxxx> wrote:
> The question was "why is the cgroupfs mounted read-only inside a container
> in non-privileged?" - when there's a cgroup namespace it seems it should be
> safe [under v2 cgroups] for the container to have write access to its
> cgroupfs?

Yes (you have the assumption of controllers on v2). I guess the RO
default is from v1 times (that are still present too).
The namespace is additional measure that makes host's cgroup tree
invisible.
(If you're privileged, nsdelegate mount option is relevant.)

> Hopefully my explanation above makes this clearer. Replacing the cgroup
> mounts set up by the container manager before exec-ing systemd is one
> possible workaround for the fact docker creates the cgroup mounts
> read-only. As I understand it, systemd requires CAP_SYS_ADMIN anyway, and
> this gives us the privileges required to modify (or unmount and recreate)
> the cgroup mounts.

CAP_SYS_ADMIN in init user namespace is needed to create cgroup
hiearchies. You can (bind) mount existing ones within a user namespace
but then you should not get access to higher levels of the hierarchies.

> This is explained at
> https://www.lewisgaul.co.uk/blog/coding/rough/2022/05/20/cgroups-questions/#why-are-the-containers-cgroup-limits-not-set-on-a-parent-cgroup-under-dockerpodman.
> I'm basically questioning why a cgroup limit applied by e.g. 'docker run
> --memory=20000000' is applied in a cgroup that is made available
> in/delegated to the container, such that the container is able to modify
> its own limit (if it has write access). It feels like there's a missing
> cgroup layer in this setup. If others agree with this assessment then I
> would be happy to bring it up on the docker/podman issue trackers.

Unprivileged container should not be able to modify its limits. (Only
cgroup.procs, cgroup.subtree_control (and cgroup.threads) of the
container root cgroup shoud be writable for the container.)

Think of it as a root cgroup on the proper host, you cannot modify real
resources (memory, CPUs,... (neglecting hotplug here)).

(That's also a give-away if you're in a container or host, e.g.
/cgroup/mount/memory.max won't exists on the host, it'd be read-only in
a (memory constrained) container.)

> Ah ok, that's interesting. So it's not technically possible to always be
> able to say "the host's active cgroup version is {1,2}", it would have to
> be on a per-controller basis such as "the cgroup memory controller is
> enabled on version {1,2}"? In practice is this likely to be a case that's
> encountered in the wild [on a host running systemd]?

Strictly speaking v1/v2 is per-controller + there exists the default
hierarchy (v2 only, no controllers by default).
With systemd in practice you encounter three setups: hybrid, unified
(and legacy) as explained in Three Different Tree Setups [1]

HTH,
Michal

[1] https://systemd.io/CGROUP_DELEGATION/