Re: [PATCH v2 3/3] vfio: Make the group FD disassociate from the iommu_group

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 12 May 2023 18:29:41 -0300

On Thu, May 11, 2023 at 04:53:42PM -0600, Alex Williamson wrote:
> On Fri,  7 Oct 2022 11:04:41 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> > Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
> > the pointer is NULL the vfio_group users promise not to touch the
> > iommu_group. This allows a driver to be hot unplugged while userspace is
> > keeping the group FD open.
> > 
> > Remove all the code waiting for the group FD to close.
> > 
> > This fixes a userspace regression where we learned that virtnodedevd
> > leaves a group FD open even though the /dev/ node for it has been deleted
> > and all the drivers for it unplugged.
> > 
> > Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
> > Reported-by: Christian Borntraeger <borntraeger@xxxxxxxxxxxxx>
> > Tested-by: Matthew Rosato <mjrosato@xxxxxxxxxxxxx>
> > Tested-by: Christian Borntraeger <borntraeger@xxxxxxxxxx>
> > Tested-by: Eric Farman <farman@xxxxxxxxxxxxx>
> > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > ---
> >  drivers/vfio/vfio.h      |  1 -
> >  drivers/vfio/vfio_main.c | 67 ++++++++++++++++++++++++++--------------
> >  2 files changed, 44 insertions(+), 24 deletions(-)
> 
> I'm not sure we're out of the woods on this one.  QE found a regression
> when unbinding a device assigned to a QEMU VM resulting in errors from
> VFIO_UNMAP_DMA and VFIO_GROUP_UNSET_CONTAINER.
> 
> When finalizing the vfio object in QEMU, it will first release the
> device and close the device fd before iterating the container address
> space to do unmaps and finally unset the container for the group.
> 
> Meanwhile the vfio-pci remove callback is sitting in
> vfio_device_put_registration() waiting for the device completion.  Once
> that occurs, it enters vfio_device_remove_group() where this patch
> removed the open file barrier that we can't have and also detaches the
> group from the container, destroying the container.  The unmaps from
> userspace were always redundant at this point since removing the last
> device from a container removes the mappings and de-privileges the
> container, but unmaps of nonexistent mappings didn't fail, nor did the
> unset container operations.

So it did this threaded which is why it didn't hang before?

> None of these are hard failures for QEMU, the regression is that we're
> logging new errors due to unintended changes to the API.  Do we need to
> gut the container but still keep the group-container association?

If it isn't creating a functional problem can we approach the logging
by doing something in qemu? ie not doing the unmaps after it closed
all the device FDs?

Otherwise I'd suggest that the container should not be unmapped when
the last device is closed?

Or turning unmap to be a NOP on a container with no devices?

Jason