On 10/6/22 7:28 PM, Matthew Rosato wrote: > On 10/6/22 6:42 PM, Jason Gunthorpe wrote: >> On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote: >>> On Thu, 6 Oct 2022 09:40:35 -0300 >>> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: >>> >>>> Testing has shown that virtnodedevd will leave the group FD open for long >>>> periods, even after all the cdevs have been destroyed. This blocks >>>> destruction of the VFIO device and is undesirable. >>>> >>>> That approach was selected to accomodate SPAPR which has an broken >>>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by >>>> realizing that it doesn't use the iommu core at all, so rules about >>>> iommu_group lifetime do not apply to it. >>>> >>>> Giving the KVM code its own kref on the iommu_group allows the VFIO core >>>> code to release its iommu_group reference earlier and we can remove the >>>> sleep that only existed for SPAPR. >>>> >>>> Jason Gunthorpe (3): >>>> vfio: Add vfio_file_is_group() >>>> vfio: Hold a reference to the iommu_group in kvm for SPAPR >>>> vfio: Make the group FD disassociate from the iommu_group >>>> >>>> drivers/vfio/pci/vfio_pci_core.c | 2 +- >>>> drivers/vfio/vfio.h | 1 - >>>> drivers/vfio/vfio_main.c | 90 +++++++++++++++++++++----------- >>>> include/linux/vfio.h | 1 + >>>> virt/kvm/vfio.c | 45 +++++++++++----- >>>> 5 files changed, 94 insertions(+), 45 deletions(-) >>> >>> Containers aren't getting cleaned up with this series, starting and >>> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev >>> device, and making use of hugepages, /proc/meminfo shows the hugepages >>> are not released on VM shutdown and I'm unable to subsequently restart >>> the VM. Thanks, >> >> Oh, I'm surprised the s390 testing didn't hit this!! > > Huh, me too, at least eventually - I think it's because we aren't pinning everything upfront but rather on-demand so the missing the type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. I definitely did multiple VM (re)starts and hot (un)plugs. But while my test workloads did some I/O, the long-running one was focused on the plug/unplug scenarios to recreate the initial issue so the I/O (and thus pinning) done would have been minimal. > > -ap and -ccw also don't pin everything upfront (and I did far less testing with those). > > Ugh. Moving forward, might be worth seeing how I can loop in some non-s390-specific vfio testing into my routine. > >> >> This hunk should remain since not all cases are closures due to device >> hot unplug >> >> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c >> index f9cb734d3991b3..62aba3a128fb8d 100644 >> --- a/drivers/vfio/vfio_main.c >> +++ b/drivers/vfio/vfio_main.c >> @@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep) >> filep->private_data = NULL; >> >> mutex_lock(&group->group_lock); >> + /* >> + * Device FDs hold a group file reference, therefore the group release >> + * is only called when there are no open devices. >> + */ >> + WARN_ON(group->notifier.head); >> + if (group->container) >> + vfio_group_detach_container(group); >> group->opened_file = NULL; >> mutex_unlock(&group->group_lock); >> return 0; > > Anyway, FWIW, I folded this in and re-ran a brief series of -pci, -ccw and -ap tests on s390 and things still look good. For completeness I'll start some longer-running pci tests next but I expect this will still be fine as well. Further s390 vfio-pci testing still looks good with this change too.