On Fri, Apr 09, 2021 at 09:13:32AM -0400, Eric Farman wrote: > > I still fail to see how calling virCgroupRemove(group->nested) in > > virCgroupRemove() would help at all. The original issue you mentioned in > > the commit message is that we log this error: > > > > unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory > > > > So calling virCgroupRemove(group->nested) would also fail with: > > > > unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory > > > > because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope') > > doesn't exist the nested cgroup ('libvirt') will not exist as well. > > While you are correct that the nested cgroup doesn't exist in sysfs, the > pointers being handled by libvirt still do since virCgroupFree() hasn't yet > been called. This message shows up because the cgroup we are processing (the > VM root cgroup) contains zeros for progfd and mapfd, and > virCgroupV2DevicesDetectProg() attempts to open the corresponding path that > systemd already cleaned up. > > The nested cgroup ('libvirt') has nonzero file descriptors, so the check at > the top of virCgroupV2DevicesDetectProg() would return before attempting to > open the nonexistent '/sys/fs/cgroup/.../libvirt/' path. In that case, the > message would NOT be generated, and the code will just go back to > virCgroupV2DevicesRemoveProg() to close the file descriptors. That's one bug that while calling virCgroupV2DevicesRemoveProg() we are passing the root cgroup but we need to pass the nested cgroup instead. This will fix the FD leak when VMs are started and destroyed while libvirt is running the whole time. > > Now to the progfd and mapfd being zero, that depends whether libvirtd > > was restarted while the VM was running. > > Huh? No, I see the progfd and mapfd fields zero in the VM root cgroup and > stored with the BPF file descriptors in the nested ('libvirt') cgroup with > your patch. Previously, they were in the VM root cgroup itself. > > When a VM is started on host > > with cgroups v2 libvirt will create BPF program and BPF map to limit > > access to system devices and stores both in the mentioned variables. > > But if you restart libvirtd it will load the BPF prog ID and BPF map ID > > only on demand, for example when destroying the VM. > > > > Now on hosts with systemd the owner of the VM root cgroup is systemd > > and because we use call talk to systemd-machined and register the VM > > there the VM root cgroup is created for us by systemd-machined and it is > > associated with qemu PID. When destroying the VM we will kill the qemu > > process which will trigger systemd-machined to automatically cleanup the > > cgroup. Once that happens kernel should eventually cleanup both BPF prog > > and BPF map that were associated with the nested cgroup because it no > > longer exist and there are no other references to the prog or map. > > It may take some time before kernel actually removes the prog and map. > > This all may be true, but as I said in my commit message libvirt's leaving > open two file descriptors for the BPF prog and map that aren't being closed > when the guest is shut down. I was attempting to trigger a different bug by > doing a virsh create/shutdown in a loop, and eventually my creates would > fail because I'd exhausted the number of open files. > > > The only thing on systemd based host we can do is to skip the whole BPF > > detection code if the directory was already removed which I've already > > proposed by handling ENOENT in virCgroupV2DevicesDetectProg(). > > Yes. That removes the message, but doesn't clean up the file descriptors > being left open. Correct, to me it was not obvious from the original commit message that we are leaking file descriptors because I was more focused on the error itself and I was understanding that you are trying to fix the code to stop printing that error. > > All of the above applies to systemd based hosts. If we consider system > > without systemd then there is an actual bug as you already pointed out > > that the BPF prog and BPF map are now associated with the nested cgroup, > > to fix that we should change only virCgroupV2Remove: > > > > diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c > > index 248d4047e5..4664492c34 100644 > > --- a/src/util/vircgroupv2.c > > +++ b/src/util/vircgroupv2.c > > @@ -523,6 +523,7 @@ static int > > virCgroupV2Remove(virCgroupPtr group) > > { > > g_autofree char *grppath = NULL; > > + virCgroupPtr parent = virCgroupGetNested(group); > > int controller; > > > > /* Don't delete the root group, if we accidentally > > @@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group) > > if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0) > > return 0; > > > > - if (virCgroupV2DevicesRemoveProg(group) < 0) > > + if (virCgroupV2DevicesRemoveProg(parent) < 0) > > return -1; > > > > return virCgroupRemoveRecursively(grppath); > > > > > > This addresses both symptoms I'm experiencing, but it feels weird to be the > only place outside of vircgroup.c that needs to navigate/understand the > difference between group and group->nested. This is why I'd proposed that > logic in the caller, so it's covered by both v1 and v2, but if it's only the > v2 code that needs this then okay. I don't have a non-systemd system nearby > to try it with. We use BPF only with cgroups v2, cgroups v1 have normal devices controller with sysfs files that we can write/read to/from. I removed most of the previous quotes as they are not relevant now. To summarize the current code has two issues: 1) We are leaking BPF prog and BPF map file descriptors because they now live in the nested "libvirt" cgroup instead of the VM root cgroup. This can be addressed by the code above. This happens only if libvirtd was not restarted between VM start and VM destroy. 2) When running on systemd and libvirtd was restarted between VM start and VM destroy we will always attempt to detect BPF prog and BPF map in order to close it right away. Now that I think about it we can most likely drop the detection completely. That will fix the unnecessary error report when the cgroup was already deleted by systemd. I'll test it on non-systemd OS as well and post a patches if you are OK with that. Thanks for noticing the FD leak! Pavel
Attachment:
signature.asc
Description: PGP signature