Thank you for the reply!
> I'm happier when it's a well defined and reproducible case
Agree. It’s somehow reproducible but still not clear — during update of k8s some nodes gets into this state with cyclic systemd mount.
> What systemd version is it? What cgroup setup is it (legacy or hybrid)?
> I'm happier when it's a well defined and reproducible case
Agree. It’s somehow reproducible but still not clear — during update of k8s some nodes gets into this state with cyclic systemd mount.
> What systemd version is it? What cgroup setup is it (legacy or hybrid)?
systemd 241 (241-23-g05e654e+)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS -ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN +PCRE2 default-hierarchy=legacy
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS -ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN +PCRE2 default-hierarchy=legacy
It uses legacy setup, there is no /sys/fs/cgroup/unified/
> Anyway you can try tracing mounts systemwide
Yup, I’ve set up audit on mount() syscall and was trying to reproduce semi-manually but still no luck: Going to update another cluster with audit on, so can catch who does such mount.
> It doesn't mean that the mount was done within the container
Yup, however such transient .mount unit appears only inside systemd-nspawn container which runs systemd inside.
The machine’s main systemd has no such transient .mount unit.
It does not prove that container or systemd do cyclic mount but move my suspicion on it due to lack of other clues (probably wrong).
Yup, I’ve set up audit on mount() syscall and was trying to reproduce semi-manually but still no luck:
# auditctl -l
-a always,exit -S mount
-a always,exit -S mount
> It doesn't mean that the mount was done within the container
Yup, however such transient .mount unit appears only inside systemd-nspawn container which runs systemd inside.
The machine’s main systemd has no such transient .mount unit.
It does not prove that container or systemd do cyclic mount but move my suspicion on it due to lack of other clues (probably wrong).
> how was systemd-nspawn instructed to realize mounts for the container
Is it defined somewhere in source of systemd-nspawn or in some configs?
> possibly after daemon-reload
Yup, I did daemon reload in outer systemd but not sure it was done in inner one.
> Is there the conflicting cgroup driver used again?
> Is there the conflicting cgroup driver used again?
Unfortunately, yes. We do use cgroupfs driver widely and for long time.
We do consider to migrate out of it as soon as possible.
Also I’m thinking to propose/create PR which disable run kubelet on systemd machine with cgroupfs driver with similar check:
https://github.com/opencontainers/runc/blob/27227a9358b54c253e3dad85cfe532a256b88e00/libcontainer/cgroups/systemd/common.go#L49
We do consider to migrate out of it as soon as possible.
Also I’m thinking to propose/create PR which disable run kubelet on systemd machine with cgroupfs driver with similar check:
https://github.com/opencontainers/runc/blob/27227a9358b54c253e3dad85cfe532a256b88e00/libcontainer/cgroups/systemd/common.go#L49
But seems k8s folks are not very interested in it
Tuesday, November 24, 2020 4:40 AM +09:00 from Michal Koutný <mkoutny@xxxxxxxx>:
On Thu, Nov 19, 2020 at 10:14:18PM +0300, Andrei Enshin <b1os@xxxxx> wrote:
> For you it might be interesting in sake of improving robustness of
> systemd in case of such invaders as kubelet+cgroupfs : )
I think the interface is clearly defined in the CGROUP_DELEGATION
document though.
I'm happy if a bug can be found in general. I'm happier when it's a well
defined and reproducible case.
> ########## (1) abandoned cgroup ##########
> > systemd isn't aware of it and it would clean the hierarchy according to its configuration
That was related to a controller hierarchy (which I understood was the
k8s issue about).
Below it is a named hierarchy there it's yet different.
> systemd hasn’t deleted the unknown hierarchy, it’s still presented:
> [...]
> cgroup.procs here and in it’s child cgroup 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 are empty.
> Seems there are no processes attached to these cgroups. Date of creation is Jul 16-17.
What systemd version is it? What cgroup setup is it (legacy or hybrid)?
> ########## (2) mysterious mount of systemd hierarchy ##########
> [...]
> Seems to be cyclic mount. Questions are who, why and when did the second mysterious mount?
> I have two candidates:
> - runc during container creation;
> - systemd, probably because it was confused by kubelet and it’s unexpected usage of cgroups.
I don't see why/how would systemd (PID 1) do this (not sure about
nspawn). Anyway you can try tracing mounts systemwide (e.g. `perf trace
-a -e syscalls:sys_enter_mount`) to find out who does the mount.
> ########## (3) suspected owner of mysterious mount is systemd-nspawn machine ##########
> [...]
> Let’s explore cgroups of centos75 machine:
> # ls -lah /sys/fs/cgroup/systemd/machine.slice/systemd-nspawn\@centos75.service/payload/system.slice/ | grep sys-fs-cgroup-systemd
>
> drwxr-xr-x. 2 root root 0 Nov 9 20:07 host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount
>
> drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-sys-fs-cgroup-systemd.mount
>
> drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-var-lib-machines-centos75-sys-fs-cgroup-systemd.mount
> There are three interesting cgroups in container. First one seems to be in relation with the abandoned cgroup and mysterious mount on the host.
Note those are cgroups created for .mount units (and under nested
payload's system.slice). It tells that within the container a mount
point at
> host/rootfs/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a/fa85/4b01/8023/69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
was visible. It doesn't mean that the mount was done within the
container.
I can't tell why was that, it depends how was systemd-nspawn instructed
to realize mounts for the container.
> Creation date is Nov 9 20:07. I’ve updated kubelet at Nov 8 12:01. Сoincidence?! I don't think so.
Yes, it can be related. For instance:
- The cyclic bind mount happened,
- it's visibility was propagated into the nspawn container
- and inner systemd created cgroup for the (generated) .mount unit
(possibly after daemon-reload).
> Q1. Let me ask, what is the meaning of mount inside centos75 container?
> /system.slice/host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount
>
> Q2. Why the mount appeared in the container at Nov 9, 20:07 ?
Hopefully, it's answered above.
> ##### mind-blowing but migh be important note #####
> [...]
> The node already seems to have not healthy mounts:
Is there the conflicting cgroup driver used again?
> # cat /proc/self/mountinfo |grep systemd | grep cgr
> 26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 866 865 0:23 / /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 5253 26 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 5251 866 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> Also seems systemd-nspawn is not affected yet, since there is no such cgroup inside centos75 container (we have it on each machine) but only abandoned one, with empty cgroup.procs:
It'd depend on the mounts propagation into that container and what
systemd inside that container did (i.e. the mount unit may not have been
created yet).
Michal
---
Best Regards,
Andrei Enshin
_______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel