Re: Systemd cgroup setup issue in containers

Mantas Mikulėnas <grawity@xxxxxxxxx> · Fri, 29 Sep 2023 14:01:52 +0300

On Fri, Sep 29, 2023, 12:54 Lewis Gaul <lewis.gaul@xxxxxxxxx> wrote:
Hi systemd team,
I've encountered an issue when running systemd inside a container using cgroups v2, where if a container exec process is created at the wrong moment during early startup then systemd will fail to move all processes into a child cgroup, and therefore fail to enable controllers due to the "no internal processes" rule introduced in cgroups v2. In other words, a systemd container is started and very soon after a process is created via e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the container's namespaces (although not a child of the container's PID 1). This is not a totally crazy thing to be doing - this was hit when testing a systemd container, using a container exec "probe" to check when the container is ready.

Wouldn't it be better to have the container inform the host via NOTIFY_SOCKET (the Type=notify mechanism)? I believe systemd has had support for sending readiness notifications from init to a container manager for quite a while.

(Alternatively, connect out to the container's systemd or dbus Unix socket and query it directly that way, but NOTIFY_SOCKET would avoid the need to time it correctly.)

Other than that – I'm not a container expert but this does seem like a self-inflicted problem to me. If you spawn processes unknown to systemd, it makes sense that systemd will fail to handle them.