Re: [PATCH 1/1] vircgroup: Cleanup nested cgroups

Eric Farman <farman@xxxxxxxxxxxxx> · Fri, 9 Apr 2021 13:32:03 -0400

On 4/9/21 10:14 AM, Pavel Hrdina wrote:
On Fri, Apr 09, 2021 at 09:13:32AM -0400, Eric Farman wrote:
I still fail to see how calling virCgroupRemove(group->nested) in
virCgroupRemove() would help at all. The original issue you mentioned in
the commit message is that we log this error:

      unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory

So calling virCgroupRemove(group->nested) would also fail with:

      unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory

because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope')
doesn't exist the nested cgroup ('libvirt') will not exist as well.

While you are correct that the nested cgroup doesn't exist in sysfs, the
pointers being handled by libvirt still do since virCgroupFree() hasn't yet
been called. This message shows up because the cgroup we are processing (the
VM root cgroup) contains zeros for progfd and mapfd, and
virCgroupV2DevicesDetectProg() attempts to open the corresponding path that
systemd already cleaned up.

The nested cgroup ('libvirt') has nonzero file descriptors, so the check at
the top of virCgroupV2DevicesDetectProg() would return before attempting to
open the nonexistent '/sys/fs/cgroup/.../libvirt/' path. In that case, the
message would NOT be generated, and the code will just go back to
virCgroupV2DevicesRemoveProg() to close the file descriptors.

That's one bug that while calling virCgroupV2DevicesRemoveProg() we are
passing the root cgroup but we need to pass the nested cgroup instead.
This will fix the FD leak when VMs are started and destroyed while
libvirt is running the whole time.

Now to the progfd and mapfd being zero, that depends whether libvirtd
was restarted while the VM was running.

Huh? No, I see the progfd and mapfd fields zero in the VM root cgroup and
stored with the BPF file descriptors in the nested ('libvirt') cgroup with
your patch. Previously, they were in the VM root cgroup itself.

When a VM is started on host
with cgroups v2 libvirt will create BPF program and BPF map to limit
access to system devices and stores both in the mentioned variables.
But if you restart libvirtd it will load the BPF prog ID and BPF map ID
only on demand, for example when destroying the VM.

Now on hosts with systemd the owner of the VM root cgroup is systemd
and because we use call talk to systemd-machined and register the VM
there the VM root cgroup is created for us by systemd-machined and it is
associated with qemu PID. When destroying the VM we will kill the qemu
process which will trigger systemd-machined to automatically cleanup the
cgroup. Once that happens kernel should eventually cleanup both BPF prog
and BPF map that were associated with the nested cgroup because it no
longer exist and there are no other references to the prog or map.
It may take some time before kernel actually removes the prog and map.

This all may be true, but as I said in my commit message libvirt's leaving
open two file descriptors for the BPF prog and map that aren't being closed
when the guest is shut down. I was attempting to trigger a different bug by
doing a virsh create/shutdown in a loop, and eventually my creates would
fail because I'd exhausted the number of open files.

The only thing on systemd based host we can do is to skip the whole BPF
detection code if the directory was already removed which I've already
proposed by handling ENOENT in virCgroupV2DevicesDetectProg().

Yes. That removes the message, but doesn't clean up the file descriptors
being left open.

Correct, to me it was not obvious from the original commit message that
we are leaking file descriptors because I was more focused on the error
itself and I was understanding that you are trying to fix the code to
stop printing that error.

I'm glad I mentioned both, though it wasn't obvious to me there was a 
relation between the two until I started looking into the file 
descriptor problem.


All of the above applies to systemd based hosts. If we consider system
without systemd then there is an actual bug as you already pointed out
that the BPF prog and BPF map are now associated with the nested cgroup,
to fix that we should change only virCgroupV2Remove:

diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c
index 248d4047e5..4664492c34 100644
--- a/src/util/vircgroupv2.c
+++ b/src/util/vircgroupv2.c
@@ -523,6 +523,7 @@ static int
   virCgroupV2Remove(virCgroupPtr group)
   {
       g_autofree char *grppath = NULL;
+    virCgroupPtr parent = virCgroupGetNested(group);
       int controller;

       /* Don't delete the root group, if we accidentally
@@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group)
       if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0)
           return 0;

-    if (virCgroupV2DevicesRemoveProg(group) < 0)
+    if (virCgroupV2DevicesRemoveProg(parent) < 0)
           return -1;

       return virCgroupRemoveRecursively(grppath);



This addresses both symptoms I'm experiencing, but it feels weird to be the
only place outside of vircgroup.c that needs to navigate/understand the
difference between group and group->nested. This is why I'd proposed that
logic in the caller, so it's covered by both v1 and v2, but if it's only the
v2 code that needs this then okay. I don't have a non-systemd system nearby
to try it with.

We use BPF only with cgroups v2, cgroups v1 have normal devices
controller with sysfs files that we can write/read to/from.

I removed most of the previous quotes as they are not relevant now.

To summarize the current code has two issues:

1) We are leaking BPF prog and BPF map file descriptors because they now
    live in the nested "libvirt" cgroup instead of the VM root cgroup.
    This can be addressed by the code above. This happens only if
    libvirtd was not restarted between VM start and VM destroy.

Agreed.


2) When running on systemd and libvirtd was restarted between VM start
    and VM destroy we will always attempt to detect BPF prog and BPF map
    in order to close it right away.

I'll trust you regarding this, because I haven't looked into what 
happens when libvirtd restarts.


    Now that I think about it we can most likely drop the detection
    completely. That will fix the unnecessary error report when the
    cgroup was already deleted by systemd.


I'll test it on non-systemd OS as well and post a patches if you are OK
with that.

That sounds great.


Thanks for noticing the FD leak!

You're welcome. Thanks for taking the time to walk through this with me! 
Cheers,
Eric


Pavel