RE: [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

"Liu, Yi L" <yi.l.liu@xxxxxxxxx> · Thu, 6 Apr 2023 10:02:10 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, April 6, 2023 7:23 AM
> 
> On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> 
> > > > QEMU can make a policy decision today because the kernel provides a
> > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > hot-reset is all but guaranteed to work.
> > >
> > > And we don't change that with cdev. If qemu wants to make the policy
> > > decision it keeps using the exact same _INFO interface to make that
> > > decision same it has always made.
> > >
> > > We weaken the actual reset action to only consider the security side.
> > >
> > > Applications that want this exclusive reset group policy simply must
> > > check it on their own. It is a reasonable API design.
> >
> > I disagree, as I've argued before, the info ioctl becomes so weak and
> > effectively arbitrary from a user perspective at being able to predict
> > whether the hot-reset ioctl works that it becomes useless, diminishing
> > the entire hot-reset info/execute API.
> 
> reset should be strictly more permissive than INFO. If INFO predicts
> reset is permitted then reset should succeed.
> 
> We don't change INFO so it cannot "becomes so weak"  ??
> 
> We don't care about the cases where INFO says it will not succeed but
> reset does (temporarily) succeed.
> 
> I don't get what argument you are trying to make or what you think is
> diminished..
> 
> Again, userspace calls INFO, if info says yes then reset *always
> works*, exactly just like today.
>
> Userspace will call reset with a 0 length FD list and it uses a
> security only check that is strictly more permissive than what
> get_info will return. So the new check is simple in the kernel and
> always works in the cases we need it to work.
> 
> What is getting things into trouble is insisting that RESET have
> additional restrictions beyond the minimum checks required for
> security.
> 
> > > I don't view it as a loophole, it is flexability to use the API in a
> > > way that is different from what qemu wants - eg an app like dpdk may
> > > be willing to tolerate a reset group that becomes unavailable after
> > > startup. Who knows, why should we force this in the kernel?
> >
> > Because look at all the problems it's causing to try to introduce these
> > loopholes without also introducing subtle bugs.
> 
> These problems are coming from tring to do this integrated version,
> not from my approach!
> 
> AFAICT there was nothing wrong with my original plan of using the
> empty fd list for reset. What Yi has here is some mashup of what you
> and I both suggested.

Hi Alex, Jason,

could be this reason. So let me try to gather the changes of this series
does and the impact as far as I know.

1) only check the ownership of opened devices in the dev_set
     in HOT_RESET ioctl.
     - Impact: it changes the relationship between _INFO  and HOT_RESET.
       As " Each group must have IOMMU protection established for the
       ioctl to succeed." in [1], existing design actually means userspace
       should own all the affected groups before heading to do HOT_RESET.
       With the change here, the user does not need to ensure all affected
       groups are opened and it can do hot-reset successfully as long as the
       devices in the affected group are just un-opened and can be reset.

       [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@xxxxxxxxxx/

2) Allow passing zero-length fd array to do hot reset
    - Impact: this uses the iommufd as ownership check in the kernel side.
      It is only supposed to be used by the users that open cdev instead of
      users that open group. The drawback is that it cannot cover the noiommu
      devices as noiommu does not use iommufd at all. But it works well for
      most cases.

3) Allow hot reset be successful when the dev_set is singleton
     - Impact: this makes sense but it seems to mess up the boundary between
     the group path and cdev path w.r.t. the usage of zero-length fd approach.
     The group path can succeed to do hot reset even if it is passing an empty
     fd array if the dev_set happens to be singleton.

4) Allow passing device fd to do hot reset
    - Impact: this is a new way for hot reset. should have no impact.

5) Extend the _INFO to report devid
    - Impact: this changes the way user to decode the info reported back.
    devid and groupid are returned per the way the queried device is opened.
    Since it was suggested to support the scenario in which some devices
    are opened via cdev while some devices are opened via group. This makes
    us to return invalid_devid for the device that is opened via group if
    it is affected by the hot reset of a device that is opened via cdev.

    This was proposed to support the future device fd passing usage which is
    only available in cdev path.

To me the major confusion is from 1) and 3). 1) changes the meaning of
_INFO and HOT_RESET, while 3) messes up the boundary.

Here is my thought:

For 1), it was proposed due to below reason[2]. We'd like to make a scenario
that works in the group path be workable in cdev path as well. But IMHO, we
may just accept that cdev path cannot work for such scenario to avoid sublte
change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
is expected. Maybe this is awkward. But if we want to keep it, we'd do it
with the awareness by user.

[2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@xxxxxxxxxx/

For 3), it was proposed when discussing the hot reset for noiommu[3]. But
it does not make hot reset always workable for noiommu in cdev, just in
case dev_set is singleton. So it is more of a general optimization that can
make the kernel skip the ownership check. But to make use of it, we may
need to test it before sanitizing the group fds from user or the iommufd
check. Maybe the dev_set singleton test in this series is not well placed.
If so, I can further modify it.

[3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@xxxxxxxxxx/

Regards,
Yi Liu

> 
> > > Remember the reason we started doing this is because we don't
> > > have easy access to the BDF anymore.
> >
> > We don't need it, the info ioctl provides the groups, the group
> > association can be learned from the DEVICE_GET_INFO ioctl, the
> > hot-reset ioctl only requires a single representative fd per affected
> > group.  dev-ids not required.
> 
> I'm not talking about triggering the ioctl.
> 
> I'm talking about whatever else qemu needs to do so that the VM is
> aware of the reset groups device-by-device on it's side so nested VFIO
> in the VM reflects the same data as the hypervisor. Maybe it doesn't
> do this right now, but the kernel API should continue to provide the
> data.
> 
> > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > > return the dev_ids.
> >
> > I don't see any justification for this.  We could add another PCI
> > specific DEVICE_GET_INFO capability to report the bdf if we really need
> > it, but reporting the group seems sufficient for this use case.
> 
> What I imagine is a single new ioctl 'get reset group 2' or something.
> It returns a list of dev_ids in the reset group. It has an output flag
> if the reset is reliable. This is the only ioctl user space needs to
> call.
> 
> The reliable test is done by simply calling the ioctl and throwing
> away the dev ids. The mapping of the VM's reset groups is done by
> processing the dev_ids to vRIDs and flowing that into the VM somehow.
> 
> We don't expose group_ids, and we don't expose BDF. It is much simpler
> and cleaner to use.
> 
> A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same
> data too, it is just not as elegant and requires userspace to do a lot
> more work to keep track of the 3 different identifiers.
> 
> > > This looks like a very complex uapi compared to the empty list option,
> > > but it seems like it would work.
> >
> > It's the same API that we have now.  What's complex is trying to figure
> > out all the subtle side-effects from the loopholes that are being
> > proposed in this series.  Thanks,
> 
> I might agree with you if we weren't now going backwards -
> ideas didn't work out and Yi has to throw stuff away. :(
> 
> Jason