I omitted one piece of information about running with --cgroupns=private thinking it was unrelated, but actually it appears maybe it is related (and perhaps highlights a variant of the issue that is seen on first-boot, not only on container restart). Again (and what makes me think it's related), I can reproduce this on a Centos host but not on Ubuntu (still with SELinux in 'permissive' mode).
[root@localhost ~]# podman run -it --name ubuntu --privileged --cgroupns private ubuntu-systemd
systemd 245.4-4ubuntu3.19 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTI)
Detected virtualization podman.
Detected architecture x86-64.
Welcome to Ubuntu 20.04.5 LTS!
Set hostname to <daca3bb894b7>.
Couldn't move remaining userspace processes, ignoring: Input/output error
Failed to create compat systemd cgroup /system.slice: No such file or directory
Failed to create compat systemd cgroup /system.slice/system-getty.slice: No such file or directory
[ OK ] Created slice system-getty.slice.
Failed to create compat systemd cgroup /system.slice/system-modprobe.slice: No such file or directory
[ OK ] Created slice system-modprobe.slice.
Failed to create compat systemd cgroup /user.slice: No such file or directory
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
systemd 245.4-4ubuntu3.19 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTI)
Detected virtualization podman.
Detected architecture x86-64.
Welcome to Ubuntu 20.04.5 LTS!
Set hostname to <daca3bb894b7>.
Couldn't move remaining userspace processes, ignoring: Input/output error
Failed to create compat systemd cgroup /system.slice: No such file or directory
Failed to create compat systemd cgroup /system.slice/system-getty.slice: No such file or directory
[ OK ] Created slice system-getty.slice.
Failed to create compat systemd cgroup /system.slice/system-modprobe.slice: No such file or directory
[ OK ] Created slice system-modprobe.slice.
Failed to create compat systemd cgroup /user.slice: No such file or directory
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
This first warning is coming from one of the same areas of code I linked in my first email: https://github.com/systemd/systemd/blob/v245/src/core/cgroup.c#L2967.
I see the same thing with '--cap-add sys_admin' instead of '--privileged', and again seen with both docker and podman.
Thanks,
Lewis
On Tue, 10 Jan 2023 at 15:28, Lewis Gaul <lewis.gaul@xxxxxxxxx> wrote:
I'm aware of the higher level of collaboration between podman and systemd compared to docker, hence primarily raising this issue from a podman angle.In privileged mode all mounts are read-write, so yes the container has write access to the cgroup filesystem. (Podman also ensures write access to the systemd cgroup subsystem mount in non-privileged mode by default).On first boot PID 1 can be found in /sys/fs/cgroup/systemd/machine.slice/libpod-<ctr-id>.scope/init.scope/cgroup.procs, whereas when the container restarts the 'init.scope/' directory does not exist and PID 1 is instead found in the parent (container root) cgroup /sys/fs/cgroup/systemd/machine.slice/libpod-<ctr-id>.scope/cgroup.procs (also reflected by /proc/1/cgroup). This is strange because systemd must be the one to create this cgroup dir in the initial boot, so I'm not sure why it wouldn't on subsequent boot?I can confirm that the container has permissions since executing a 'mkdir' in /sys/fs/cgroup/systemd/machine.slice/libpod-<ctr-id>.scope/ inside the container succeeds after the restart, so I have no idea why systemd is not creating the 'init.scope/' dir. I notice that inside the container's systemd cgroup mount 'system.slice/' does exist, but 'user.slice/' also does not (both exist on normal boot). Is there any way I can find systemd logs that might indicate why the cgroup dir creation is failing?One final datapoint: the same is seen when using a private cgroup namespace (via 'podman run --cgroupns=private'), although then the error is then, as expected, "Failed to attach 1 to compat systemd cgroup /init.scope: No such file or directory".I could raise this with the podman team, but it seems more in the systemd area given it's a systemd warning and I would expect systemd to be creating this cgroup dir?Thanks,LewisOn Tue, 10 Jan 2023 at 14:48, Lennart Poettering <lennart@xxxxxxxxxxxxxx> wrote:On Di, 10.01.23 13:18, Lewis Gaul (lewis.gaul@xxxxxxxxx) wrote:
> Following 'setenforce 0' I still see the same issue (I was also suspecting
> SELinux!).
>
> A few additional data points:
> - this was not seen when using systemd v230 inside the container
> - this is also seen on CentOS 8.4
> - this is seen under docker even if the container's cgroup driver is
> changed from 'cgroupfs' to 'systemd'
docker is garbage. They are hostile towards running systemd inside
containers.
podman upstream is a lot friendly, and apparently what everyone in OCI
is going towards these days.
I have not much experience with podman though, and in particular not
old versions. Next step would probably be to look at what precisely
causes the permission issue, via strace.
but did you make sure your container actually gets write access to the
cgroup trees?
anyway, i'd recommend asking the podman community for help about this.
Lennart
--
Lennart Poettering, Berlin