Re: Systemd cgroup setup issue in containers

Lennart Poettering <lennart@xxxxxxxxxxxxxx> · Fri, 29 Sep 2023 14:28:47 +0200

On Fr, 29.09.23 10:53, Lewis Gaul (lewis.gaul@xxxxxxxxx) wrote:

> Hi systemd team,
>
> I've encountered an issue when running systemd inside a container using
> cgroups v2, where if a container exec process is created at the wrong
> moment during early startup then systemd will fail to move all processes
> into a child cgroup, and therefore fail to enable controllers due to the
> "no internal processes" rule introduced in cgroups v2. In other words, a
> systemd container is started and very soon after a process is created via
> e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
> container's namespaces (although not a child of the container's PID
> 1).

Yeah, joining into a container is really weird, it makes a process
appear from nowhere, possibly blocking resources, outside of the
resource or lifecycle control of the code in the container, outside of
any security restrictions and so on.

I personally think joining a container via joining the namespaces
(i.e. podman exec) might be OK for debugging, but it's not a good
default workflow. Unfortunately the problems with the approach are not
well understood by the container people.

In systemd's own container logic (i.e. systemd-nspawn + machinectl) we
hence avoid doing anything like this. "machinectl shell" and
related commands will instead talk to PID 1 in the container and ask it
to spawn something off, rather than doing so yourself.

Kinda related to this: util-linux' "unshare" tool (which can be used
to generically enter a container like this) also is pretty broken in
this regard btw, and I asked them to fix that, but nothing happened
there yet:

https://github.com/util-linux/util-linux/issues/2006

I'd advise "podman" and these things to never place joined processes
in the root cgroup of the container if they delegate cgroup access to
the container, because that really defeats the point. Instead they
should always join the cgroup of PID 1 in the container (which they
might already do I think), and if PID 1 is in the root cgroup, then
they should create their own subcgroup "/joined" or so, and put the
process in there, to not collide with the "no processes in inner
groups" rule of cgroupv2.

> This is not a totally crazy thing to be doing - this was hit when testing a
> systemd container, using a container exec "probe" to check when the
> container is ready.
>
> More precisely, the problem manifests as follows (in
> https://github.com/systemd/systemd/blob/081c50ed3cc081278d15c03ea54487bd5bebc812/src/core/cgroup.c#L3676
> ):
> - Container exec processes are placed in the container's root cgroup by
> default, but if this fails (due to the "no internal processes" rule) then
> container PID 1's cgroup is used (see
> https://github.com/opencontainers/runc/issues/2356).

This is a really bad idea. At the very least the rule should be
reversed (which would still be racy, but certainly better). But as
mentioned they should never put something in the root cgroup if cgroup
delegation is on.

> - At systemd startup, systemd tries to create the init.scope cgroup and
> move all processes into it.
> - If a container exec process is created after finding procs to move and
> moving them but before enabling controllers then the exec process will be
> placed in the root cgroup.
> - When systemd then tries to enable controllers via subtree_control in the
> container's root cgroup, this fails because the exec process is in that
> cgroup.
>
> The root of the problem here is that moving processes out of a cgroup and
> enabling controllers (such that new processes cannot be created there) is
> not an atomic operation, meaning there's a window where a new process can
> get in the way. One possible solution/workaround in systemd would be to
> retry under this condition. Or perhaps this should be considered a bug in
> the container runtimes?

Yes, that's what I think. They should fix that.

Lennart

--
Lennart Poettering, Berlin