Re: [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 6 Apr 2023 11:07:36 -0600

On Thu, 6 Apr 2023 06:34:08 +0000
"Liu, Yi L" <yi.l.liu@xxxxxxxxx> wrote:

> Hi Alex,
> 
> > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > Sent: Thursday, April 6, 2023 3:50 AM
> > 
> > On Wed, 5 Apr 2023 16:21:09 -0300
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >   
> > > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:  
> > > > Usability needs to be a consideration as well.  An interface where the
> > > > result is effectively arbitrary from a user perspective because the
> > > > kernel is solely focused on whether the operation is allowed,
> > > > evaluating constraints that the user is unaware of and cannot control,
> > > > is unusable.  
> > >
> > > Considering this API is only invoked by qemu we might be overdoing
> > > this usability and 'no shoot in foot' view.  
> > 
> > Ok, I'm not sure why we're diminishing the de facto vfio userspace...
> >   
> > > > > This is a good point that qemu needs to make a policy decision if it
> > > > > is happy about the VFIO configuration - but that is a policy decision
> > > > > that should not become entangled with the kernel's security checks.
> > > > >
> > > > > Today qemu can make this policy choice the same way it does right now
> > > > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > > > as today. We already discussed that we need to expose the group ID
> > > > > through an ioctl someplace.  
> > > >
> > > > QEMU can make a policy decision today because the kernel provides a
> > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > hot-reset is all but guaranteed to work.  
> > >
> > > And we don't change that with cdev. If qemu wants to make the policy
> > > decision it keeps using the exact same _INFO interface to make that
> > > decision same it has always made.
> > >
> > > We weaken the actual reset action to only consider the security side.
> > >
> > > Applications that want this exclusive reset group policy simply must
> > > check it on their own. It is a reasonable API design.  
> > 
> > I disagree, as I've argued before, the info ioctl becomes so weak and
> > effectively arbitrary from a user perspective at being able to predict
> > whether the hot-reset ioctl works that it becomes useless, diminishing
> > the entire hot-reset info/execute API.
> >   
> > > > > If this is too awkward we could add a query to the kernel if the cdev
> > > > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > > > the reset set.  
> > > >
> > > > That's essentially what we have if there are valid dev-ids for each
> > > > affected device in the info ioctl.  
> > >
> > > If you have dev-ids for everything, yes. If you don't, then you can't
> > > make the same policy choice using a dev-id interface.  
> > 
> > Exactly, you can't make any policy choice because the success or
> > failure of the hot-reset ioctl can't be known.  
> 
> could you elaborate a bit about what the policy is here. As far as I know,
> QEMU makes use of the information reported by _INFO to check:
> - if all the affected groups are owned by the current QEMU[1]
> - if the affected devices are opened by the current QEMU, if yes, QEMU
>   needs to use vfio_pci_pre_reset() to do preparation before issuing
>   hot rest[1]
> 
> [1] vfio_pci_hot_reset() in https://github.com/qemu/qemu/blob/master/hw/vfio/pci.c

Regarding the policy decisions, look for instance at the distinction
between vfio_pci_hot_reset_one() vs vfio_pci_hot_reset_multi(), or the
way QEMU will opt for a bus reset if it believes only a PM reset is
available.

In my proposal, I did miss that if _INFO reports the group and bdf that
allows QEMU to associate fd passed devices to a group affected by the
reset, but not specifically whether the device is affected by the
reset.  I think that would be justification for capabilities on the
DEVICE_GET_INFO ioctl to report both the group and PCI address as
separate capabilities.

> > > > I don't think it helps the user experience to create loopholes where
> > > > the hot-reset ioctl can still work in spite of those missing
> > > > devices.  
> > >
> > > I disagree. The easy straightforward design is that the reset ioctl
> > > works if the process has security permissions. Mixing a policy check
> > > into the kernel on this path is creating complexity we don't really
> > > need.
> > >
> > > I don't view it as a loophole, it is flexability to use the API in a
> > > way that is different from what qemu wants - eg an app like dpdk may
> > > be willing to tolerate a reset group that becomes unavailable after
> > > startup. Who knows, why should we force this in the kernel?  
> > 
> > Because look at all the problems it's causing to try to introduce these
> > loopholes without also introducing subtle bugs.  There's an argument
> > that we're overly strict, which is better than the alternative, which
> > seems to be what we're dabbling with.  It is a straightforward
> > interface for the hot-reset ioctl to mirror the information provided
> > via the hot-reset info ioctl.
> >   
> > > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > > > capability chains, we could add a capability that reports the group ID
> > > > for the device.  
> > >
> > > I was going to put that in an iommufd ioctl so it works with VDPA too,
> > > but sure, lets assume we can get the group ID from a cdev fd.
> > >  
> > > > The hot-reset info ioctl remains as it is today, reporting group-ids
> > > > and bdfs.  
> > >
> > > Sure, but userspace still needs to know how to map the reset sets into
> > > dev-ids.  
> > 
> > No, it doesn't.
> >   
> > > Remember the reason we started doing this is because we don't
> > > have easy access to the BDF anymore.  
> > 
> > We don't need it, the info ioctl provides the groups, the group
> > association can be learned from the DEVICE_GET_INFO ioctl, the
> > hot-reset ioctl only requires a single representative fd per affected
> > group.  dev-ids not required.
> >   
> > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > > return the dev_ids.  
> > 
> > I don't see any justification for this.  We could add another PCI
> > specific DEVICE_GET_INFO capability to report the bdf if we really need
> > it, but reporting the group seems sufficient for this use case.  
> 
> IMHO, the knowledge of group may be not enough. Take QEMU as an example.
> QEMU not only needs to ensure the group is owned by it, it also needs to
> do preparation on the devices that are already in use and affected by
> the hot reset on a new opened device. If there is only group knowledge,
> QEMU may blindly prepares all the devices that are already opened and
> belong to the same iommu group. But as I got in the discussion iommu
> group is not equal to hot reset scope (a.k.a. dev_set). is it? It is
> possible that devices in an iommu_group may span into multiple hot
> reset scope. For such case, get bdf info from cdev fd is necessary.

Yes, you're correct, group and reset scope are not equivalent, so we'd
require a means to get both the group and the bdf for the device.
Knowing the bdf allows the user to know which opened devices are
directly affected by the reset, knowing the group allows the user to
know if ancillary affected devices are within the set of groups the
user owns and therefore effectively under their purview.  Thanks,

Alex