Re: [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 11 Apr 2023 21:01:06 -0300

On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:

> > Management tools already need to understand dev_set if they want to
> > offer reliable reset support to the VMs. Same as today.
> 
> I don't think that's true. Our primary hot-reset use case is GPUs and
> subordinate functions, where the isolation and reset scope are often
> sufficiently similar to make hot-reset possible, regardless whether
> all the functions are assigned to a VM.  I don't think you'll find any
> management tools that takes reset scope into account otherwise.

When I think of "reliable reset support" I think of the management
tool offering a checkbox that says "ensure PCI function reset
availability" and if checked it will not launch the VM without a
working reset.

If the user configures a set of VFIO devices and then hopes they get
working reset, that is fine, but doesn't require any reporting of
reset groups, or iommu groups to the management layer to work.

> > > As I understand the proposal, QEMU now gets to attempt to
> > > claim ownership of the dev_set, so it opportunistically extends its
> > > ownership and may block other users from the affected devices.  
> > 
> > We can decide the policy for the kernel to accept a claim. I suggested
> > below "same as today" - it must hold all the groups within the
> > iommufd_ctx.
> 
> It must hold all the groups [that the user doesn't know about because
> it's not a formal part of the cdev API] within the iommufd_ctx?

You keep going back to this, but I maintain userspace doesn't
care. qemu is given a list of VFIO devices to use, all it wants to
know is if it is allowed to use reset or not. Why should it need to
know groups and group_ids to get that binary signal out of the kernel?

> > The simplest option for no-iommu is to require it to pass in every
> > device fd to the reset ioctl.
> 
> Which ironically is exactly how it ends up working today, each no-iommu
> device has a fake IOMMU group, so every affected device (group) needs
> to be provided.

Sure, that is probably the way forward for no-iommu. Not that anyone
uses it..

The kicker is we don't force the user to generate a de-duplicated list
of devices FDs, one per group, just because.

> > I want to re-focus on the basics of what cdev is supposed to be doing,
> > because several of the idea you suggested seem against this direction:
> > 
> >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> >    by compiling all the vfio_group infrastructure out. iommu_groups
> >    continue to exist.
> >    
> >    So converting a cdev to a vfio_group is not an allowed operation.
> 
> My only statements in this respect were towards the notion that IOMMU
> groups continue to exist.  I'm well aware of the desire to deprecate
> and remove vfio groups.

Yes

> >  - no-iommu should not have iommu_groups. We enforce this by compiling
> >    out all the no-iommu vfio_group infrastructure.
> 
> This is not logically inferred from the above if IOMMU groups continue
> to exist and continue to be a basis for describing DMA ownership as
> well as "reset groups"

It is not ment to flow out of the above, it is a seperate statement. I
want the iommu_group mechanism to stop being abused outside the iommu
core code. The only thing that should be creating groups is an
attached iommu driver operating under ops->device_group().

VFIO needed this to support mdev and no-iommu. We already have mdev
free of iommu_groups, I would like no-iommu to also be free of it too,
we are very close.

That would leave POWER as the only abuser of the
iommu_group_add_device() API, and it is only doing it because it
hasn't got a proper iommu driver implementation yet. It turns out
their abuse is mislocked and maybe racy to boot :(

> >  - cdev APIs should ideally not require the user to know the group_id,
> >    we should try hard to design APIs to avoid this.
> 
> This is a nuance, group_id vs group, where it's been previously
> discussed that users will need to continue to know the boundaries of a
> group for the purpose of DMA isolation and potentially IOAS
> independence should cdev/iommufd choose to tackle those topics.

Yes, group_id is a value we have no specific use for and would require
userspace to keep seperate track of. I'd prefer to rely on dev_id as
much as possible instead.

> What is the actual proposal here?

I don't know anymore, you don't seem to like this direction either...

> You've said that hot-reset works if the iommufd_ctx has
> representation from each affected group, the INFO ioctl remains as
> it is, which suggests that it's reporting group ID and BDF, yet only
> sysfs tells the user the relation between a vfio cdev and a group
> and we're trying to enable a pass-by-fd model for cdev where the
> user has no reference to a sysfs node for the device.  Show me how
> these pieces fit together.

I prefer the version where INFO2 returns the dev_id, but info can work
if we do the BDF cap like you suggested to Yi

> OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> device has an IOMMU group

I don't desire every VFIO device to have an iommu_group. I want VFIO
devices with real IOMMU drivers to have an iommu_group. mdev and
no-iommu should not. I don't want to add them back into the design
just so INFO has a value to return.

I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
iommu_group...

I see this problem as a few basic requirements from a qemu-like
application:

 1) Does the configuration I was given support reset right now?
 2) Will the configuration I was given support reset for the duration
    of my execution?
 3) What groups of the devices I already have open does the reset
    effect?
 4) For debugging, report to the user the full list of devices in the
    reset group, in a way that relates back to sysfs.
 5) Away to trigger a reset on a group of devices

#1/#2 is the API I suggested here. Ask the kernel if the current
configuration works, and ask it to keep it working.

#3 is either INFO and a CAP for BDF or INFO2 reporting dev_id

#4 is either INFO and print the BDFs or INFO2 reporting the struct
vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

#5 is adjusting the FD list in existing RESET ioctl. Remove the need
for userspace to specify a minimal exact list of FDs means userspace
doesn't need the information to figure out what that list actually
is. Pass a 0 length list and use iommufdctx.

None of these requirements suggests to me that qemu needs to know the
group_id, or that it needs to have enough information to know how to
fix an unavailable reset.

Did I miss a requirement here?

Regards,
Jason