Thank you for the explanation!
> systemd isn't aware of it and it would clean the hierarchy according to its configuration
Yup, that is what I would expect from systemd after reading docs — to be a full, robust and confident owner of the cgroups hierarchies on a host.
However, if you don’t mind to go deeper into «an undefined behavior» and dig a bit into bits of implementation, let me continue exploring this particular case.
It’s interesting for me mostly for educational purpose.
For you it might be interesting in sake of improving robustness of systemd in case of such invaders as kubelet+cgroupfs : )
########## (1) abandoned cgroup ##########
> systemd isn't aware of it and it would clean the hierarchy according to its configuration
systemd hasn’t deleted the unknown hierarchy, it’s still presented:
cgroup.procs here and in it’s child cgroup 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 are empty.
Seems there are no processes attached to these cgroups. Date of creation is Jul 16-17.
########## (2) mysterious mount of systemd hierarchy ##########Let’s look at it from another point of view. From point of view of host mounts. We’ve already seen it. On host we can see two mounts of the same hierarchy: Seems to be cyclic mount. Questions are who, why and when did the second mysterious mount?
I have two candidates:
- runc during container creation;
- systemd, probably because it was confused by kubelet and it’s unexpected usage of cgroups.
########## (3) suspected owner of mysterious mount is systemd-nspawn machine ##########
Let’s look at the situation from third point of view. From systemd-nspawn point of view:
Let’s explore cgroups of centos75 machine: There are three interesting cgroups in container. First one seems to be in relation with the abandoned cgroup and mysterious mount on the host.
Creation date is Nov 9 20:07. I’ve updated kubelet at Nov 8 12:01. Сoincidence?! I don't think so.
##### questions #####
Unfortunately I don’t know how to check creation date/time of mount point (2826 26 0:23) on host system.
Probably systemd-nspawn is disrupted with abandoned cgroup created by kubelet.
Q1. Let me ask, what is the meaning of mount inside centos75 container?
/system.slice/host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount
Q2. Why the mount appeared in the container at Nov 9, 20:07 ?
Understanding of the logic behind such situation, even though it’s obviously wrong usage of systemd and kubelet+cgroupfs, will help us to make some part(s) more robust and resistant for such kind of interventions.
##### mind-blowing but migh be important note #####
Here is one node in another cluster which is still not updated to kubelet 1.19.2 (update to 1.19.2 reveals the situation since kubelet starts to crash).
It runs kubelet v1.18.6 with hyperkube inside rkt.
The node already seems to have not healthy mounts:
Also seems systemd-nspawn is not affected yet, since there is no such cgroup inside centos75 container (we have it on each machine) but only abandoned one, with empty cgroup.procs:
# find /sys/fs/ -name '*64ad01*'
/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3
> systemd isn't aware of it and it would clean the hierarchy according to its configuration
Yup, that is what I would expect from systemd after reading docs — to be a full, robust and confident owner of the cgroups hierarchies on a host.
However, if you don’t mind to go deeper into «an undefined behavior» and dig a bit into bits of implementation, let me continue exploring this particular case.
It’s interesting for me mostly for educational purpose.
For you it might be interesting in sake of improving robustness of systemd in case of such invaders as kubelet+cgroupfs : )
########## (1) abandoned cgroup ##########
> systemd isn't aware of it and it would clean the hierarchy according to its configuration
systemd hasn’t deleted the unknown hierarchy, it’s still presented:
# ls -lah /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/
total 0
drwxr-xr-x. 3 root root 0 Jul 16 08:06 .
drwxr-xr-x. 130 root root 0 Jul 16 08:06 ..
drwxr-xr-x. 3 root root 0 Jul 16 08:10 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
-rw-r--r--. 1 root root 0 Jul 17 06:04 cgroup.clone_children
-rw-r--r--. 1 root root 0 Jul 17 06:04 cgroup.procs
-rw-r--r--. 1 root root 0 Jul 17 06:04 notify_on_release
-rw-r--r--. 1 root root 0 Jul 17 06:04 tasks
total 0
drwxr-xr-x. 3 root root 0 Jul 16 08:06 .
drwxr-xr-x. 130 root root 0 Jul 16 08:06 ..
drwxr-xr-x. 3 root root 0 Jul 16 08:10 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
-rw-r--r--. 1 root root 0 Jul 17 06:04 cgroup.clone_children
-rw-r--r--. 1 root root 0 Jul 17 06:04 cgroup.procs
-rw-r--r--. 1 root root 0 Jul 17 06:04 notify_on_release
-rw-r--r--. 1 root root 0 Jul 17 06:04 tasks
cgroup.procs here and in it’s child cgroup 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 are empty.
Seems there are no processes attached to these cgroups. Date of creation is Jul 16-17.
########## (2) mysterious mount of systemd hierarchy ##########
# cat /proc/self/mountinfo | grep cgr | grep syst
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
2826 26 0:23 /kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
2826 26 0:23 /kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
I have two candidates:
- runc during container creation;
- systemd, probably because it was confused by kubelet and it’s unexpected usage of cgroups.
########## (3) suspected owner of mysterious mount is systemd-nspawn machine ##########
Let’s look at the situation from third point of view. From systemd-nspawn point of view:
# machinectl list
MACHINE CLASS SERVICE OS VERSION ADDRESSES
centos75 container systemd-nspawn centos 7 -
frr container systemd-nspawn ubuntu 18.04 -
2 machines listed.
MACHINE CLASS SERVICE OS VERSION ADDRESSES
centos75 container systemd-nspawn centos 7 -
frr container systemd-nspawn ubuntu 18.04 -
2 machines listed.
Let’s explore cgroups of centos75 machine:
# ls -lah /sys/fs/cgroup/systemd/machine.slice/systemd-nspawn\@centos75.service/payload/system.slice/ | grep sys-fs-cgroup-systemd
drwxr-xr-x. 2 root root 0 Nov 9 20:07 host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount
drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-sys-fs-cgroup-systemd.mount
drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-var-lib-machines-centos75-sys-fs-cgroup-systemd.mount
drwxr-xr-x. 2 root root 0 Nov 9 20:07 host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount
drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-sys-fs-cgroup-systemd.mount
drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-var-lib-machines-centos75-sys-fs-cgroup-systemd.mount
Creation date is Nov 9 20:07. I’ve updated kubelet at Nov 8 12:01. Сoincidence?! I don't think so.
##### questions #####
Unfortunately I don’t know how to check creation date/time of mount point (2826 26 0:23) on host system.
Probably systemd-nspawn is disrupted with abandoned cgroup created by kubelet.
Q1. Let me ask, what is the meaning of mount inside centos75 container?
/system.slice/host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount
Q2. Why the mount appeared in the container at Nov 9, 20:07 ?
Understanding of the logic behind such situation, even though it’s obviously wrong usage of systemd and kubelet+cgroupfs, will help us to make some part(s) more robust and resistant for such kind of interventions.
##### mind-blowing but migh be important note #####
Here is one node in another cluster which is still not updated to kubelet 1.19.2 (update to 1.19.2 reveals the situation since kubelet starts to crash).
It runs kubelet v1.18.6 with hyperkube inside rkt.
The node already seems to have not healthy mounts:
# cat /proc/self/mountinfo |grep systemd | grep cgr
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
866 865 0:23 / /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
5253 26 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
5251 866 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
# find /sys/fs/ -name '*64ad01*'
/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3
Thursday, November 19, 2020 7:32 PM +09:00 from Michal Koutný <mkoutny@xxxxxxxx>:
Hi.
On Wed, Nov 18, 2020 at 09:46:03PM +0300, Andrei Enshin <b1os@xxxxx> wrote:
> Just out of curiosity, how systemd in particular may be disrupted with
> such record in root of it’s cgroups hierarchy as /kubpods/bla/bla
> during service (de)activation?
> Or how it may disrupt the kubelet or workload running by it?
If processes from kubeletet.service are migrated elsewhere, systemd may
lose ability to associate it with the service (which may or may not be
correct, I didn't check this particular case).
In the opposite direction, if container runtime builds up a hierarchy
for a controller, systemd isn't aware of it and it would clean the
hierarchy according to its configuration (which can, for instance, be no
controllers at all) and happens during unit (de)activation. The
containers can get away with it when there are no unit changes at the
moment but that's not what you want. Furthermore, since cgroup
operations for a unit usually involve family [1], the interference may
happen even when apparently unrelated unit changes. (This applies to the
most common "hybrid" cgroup layout.)
> Seems I missed some technical details how exact it will interfere.
There's the defined interface (delegation or DBus API) and both parties
(systemd, container runtimes) have freedom to implement cgroups as they
wish within these limits.
If they overlap though, you get an undefined behavior in principle.
That's the reason why to stick to this convention.
Michal
[1] This is rather an implementation detail
https://github.com/systemd/systemd/blob/f56a9cbf9c20cd798258d3db302d51bf21458b38/src/core/cgroup.c#L2326
---
Best Regards,
Andrei Enshin
_______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel