Hi systemd team,
I've encountered an issue when running systemd inside a container using cgroups v2, where if a container exec process is created at the wrong moment during early startup then systemd will fail to move all processes into a child cgroup, and therefore fail to enable controllers due to the "no internal processes" rule introduced in cgroups v2. In other words, a systemd container is started and very soon after a process is created via e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the container's namespaces (although not a child of the container's PID 1). This is not a totally crazy thing to be doing - this was hit when testing a systemd container, using a container exec "probe" to check when the container is ready.
More precisely, the problem manifests as follows (in https://github.com/systemd/systemd/blob/081c50ed3cc081278d15c03ea54487bd5bebc812/src/core/cgroup.c#L3676):
- Container exec processes are placed in the container's root cgroup by default, but if this fails (due to the "no internal processes" rule) then container PID 1's cgroup is used (see https://github.com/opencontainers/runc/issues/2356).
- At systemd startup, systemd tries to create the init.scope cgroup and move all processes into it.
- If a container exec process is created after finding procs to move and moving them but before enabling controllers then the exec process will be placed in the root cgroup.
- When systemd then tries to enable controllers via subtree_control in the container's root cgroup, this fails because the exec process is in that cgroup.
The root of the problem here is that moving processes out of a cgroup and enabling controllers (such that new processes cannot be created there) is not an atomic operation, meaning there's a window where a new process can get in the way. One possible solution/workaround in systemd would be to retry under this condition. Or perhaps this should be considered a bug in the container runtimes?
I have some tests exercising systemd containers at https://github.com/LewisGaul/systemd-containers which are able to reproduce this issue on a cgroups v2 host (in testcase tests/test_exec_procs.py::test_exec_proc_spam):
(venv) root@ubuntu:~/systemd-containers# pytest --log-cli-level debug -k exec_proc_spam --cgroupns private --setup-modes default --container-exe podman
INFO tests.conftest:conftest.py:474 Running container image localhost/ubuntu-systemd:20.04 with args: entrypoint=, command=['bash', '-c', 'sleep 1 && exec /sbin/init'], cap_add=['sys_admin'], systemd=always, tty=True, interactive=True, detach=True, remove=False, cgroupns=private, name=systemd-tests-1695981045.12
DEBUG tests.test_exec_procs:test_exec_procs.py:106 Got PID 1 cgroups:
0::/init.scope
DEBUG tests.test_exec_procs:test_exec_procs.py:111 Got exec proc 3 cgroups:
0::/init.scope
DEBUG tests.test_exec_procs:test_exec_procs.py:111 Got exec proc 21 cgroups:
0::/
DEBUG tests.test_exec_procs:test_exec_procs.py:114 Enabled controllers: set()
============================================================================= short test summary info =============================================================================
FAILED tests/test_exec_procs.py::test_exec_proc_spam[private-unified-default] - AssertionError: assert set() >= {'memory', 'pids'}
DEBUG tests.test_exec_procs:test_exec_procs.py:106 Got PID 1 cgroups:
0::/init.scope
DEBUG tests.test_exec_procs:test_exec_procs.py:111 Got exec proc 3 cgroups:
0::/init.scope
DEBUG tests.test_exec_procs:test_exec_procs.py:111 Got exec proc 21 cgroups:
0::/
DEBUG tests.test_exec_procs:test_exec_procs.py:114 Enabled controllers: set()
============================================================================= short test summary info =============================================================================
FAILED tests/test_exec_procs.py::test_exec_proc_spam[private-unified-default] - AssertionError: assert set() >= {'memory', 'pids'}
Does anyone have any thoughts on this? Should this be considered a systemd bug, or is it at least worth adding in some explicitly handling for this? Is there something container runtimes are doing wrong here from the perspective of systemd?
Thanks,
Lewis