On 10/6/22 6:42 PM, Jason Gunthorpe wrote: > On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote: >> On Thu, 6 Oct 2022 09:40:35 -0300 >> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: >> >>> Testing has shown that virtnodedevd will leave the group FD open for long >>> periods, even after all the cdevs have been destroyed. This blocks >>> destruction of the VFIO device and is undesirable. >>> >>> That approach was selected to accomodate SPAPR which has an broken >>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by >>> realizing that it doesn't use the iommu core at all, so rules about >>> iommu_group lifetime do not apply to it. >>> >>> Giving the KVM code its own kref on the iommu_group allows the VFIO core >>> code to release its iommu_group reference earlier and we can remove the >>> sleep that only existed for SPAPR. >>> >>> Jason Gunthorpe (3): >>> vfio: Add vfio_file_is_group() >>> vfio: Hold a reference to the iommu_group in kvm for SPAPR >>> vfio: Make the group FD disassociate from the iommu_group >>> >>> drivers/vfio/pci/vfio_pci_core.c | 2 +- >>> drivers/vfio/vfio.h | 1 - >>> drivers/vfio/vfio_main.c | 90 +++++++++++++++++++++----------- >>> include/linux/vfio.h | 1 + >>> virt/kvm/vfio.c | 45 +++++++++++----- >>> 5 files changed, 94 insertions(+), 45 deletions(-) >> >> Containers aren't getting cleaned up with this series, starting and >> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev >> device, and making use of hugepages, /proc/meminfo shows the hugepages >> are not released on VM shutdown and I'm unable to subsequently restart >> the VM. Thanks, > > Oh, I'm surprised the s390 testing didn't hit this!! Huh, me too, at least eventually - I think it's because we aren't pinning everything upfront but rather on-demand so the missing the type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. I definitely did multiple VM (re)starts and hot (un)plugs. But while my test workloads did some I/O, the long-running one was focused on the plug/unplug scenarios to recreate the initial issue so the I/O (and thus pinning) done would have been minimal. -ap and -ccw also don't pin everything upfront (and I did far less testing with those). Ugh. Moving forward, might be worth seeing how I can loop in some non-s390-specific vfio testing into my routine. > > This hunk should remain since not all cases are closures due to device > hot unplug > > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c > index f9cb734d3991b3..62aba3a128fb8d 100644 > --- a/drivers/vfio/vfio_main.c > +++ b/drivers/vfio/vfio_main.c > @@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep) > filep->private_data = NULL; > > mutex_lock(&group->group_lock); > + /* > + * Device FDs hold a group file reference, therefore the group release > + * is only called when there are no open devices. > + */ > + WARN_ON(group->notifier.head); > + if (group->container) > + vfio_group_detach_container(group); > group->opened_file = NULL; > mutex_unlock(&group->group_lock); > return 0; Anyway, FWIW, I folded this in and re-ran a brief series of -pci, -ccw and -ap tests on s390 and things still look good. For completeness I'll start some longer-running pci tests next but I expect this will still be fine as well.