Re: [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 5 Apr 2023 20:22:40 -0300

On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:

> > > QEMU can make a policy decision today because the kernel provides a
> > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > hot-reset is all but guaranteed to work.    
> > 
> > And we don't change that with cdev. If qemu wants to make the policy
> > decision it keeps using the exact same _INFO interface to make that
> > decision same it has always made.
> > 
> > We weaken the actual reset action to only consider the security side.
> > 
> > Applications that want this exclusive reset group policy simply must
> > check it on their own. It is a reasonable API design.
> 
> I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API.

reset should be strictly more permissive than INFO. If INFO predicts
reset is permitted then reset should succeed.

We don't change INFO so it cannot "becomes so weak"  ??

We don't care about the cases where INFO says it will not succeed but
reset does (temporarily) succeed.

I don't get what argument you are trying to make or what you think is
diminished..

Again, userspace calls INFO, if info says yes then reset *always
works*, exactly just like today.

Userspace will call reset with a 0 length FD list and it uses a
security only check that is strictly more permissive than what
get_info will return. So the new check is simple in the kernel and
always works in the cases we need it to work.

What is getting things into trouble is insisting that RESET have
additional restrictions beyond the minimum checks required for
security.

> > I don't view it as a loophole, it is flexability to use the API in a
> > way that is different from what qemu wants - eg an app like dpdk may
> > be willing to tolerate a reset group that becomes unavailable after
> > startup. Who knows, why should we force this in the kernel?
> 
> Because look at all the problems it's causing to try to introduce these
> loopholes without also introducing subtle bugs.

These problems are coming from tring to do this integrated version,
not from my approach!

AFAICT there was nothing wrong with my original plan of using the
empty fd list for reset. What Yi has here is some mashup of what you
and I both suggested.

> > Remember the reason we started doing this is because we don't
> > have easy access to the BDF anymore.
> 
> We don't need it, the info ioctl provides the groups, the group
> association can be learned from the DEVICE_GET_INFO ioctl, the
> hot-reset ioctl only requires a single representative fd per affected
> group.  dev-ids not required.

I'm not talking about triggering the ioctl.

I'm talking about whatever else qemu needs to do so that the VM is
aware of the reset groups device-by-device on it's side so nested VFIO
in the VM reflects the same data as the hypervisor. Maybe it doesn't
do this right now, but the kernel API should continue to provide the
data.

> > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > return the dev_ids.
> 
> I don't see any justification for this.  We could add another PCI
> specific DEVICE_GET_INFO capability to report the bdf if we really need
> it, but reporting the group seems sufficient for this use case.

What I imagine is a single new ioctl 'get reset group 2' or something.
It returns a list of dev_ids in the reset group. It has an output flag
if the reset is reliable. This is the only ioctl user space needs to
call.

The reliable test is done by simply calling the ioctl and throwing
away the dev ids. The mapping of the VM's reset groups is done by
processing the dev_ids to vRIDs and flowing that into the VM somehow.

We don't expose group_ids, and we don't expose BDF. It is much simpler
and cleaner to use.

A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same
data too, it is just not as elegant and requires userspace to do a lot
more work to keep track of the 3 different identifiers.

> > This looks like a very complex uapi compared to the empty list option,
> > but it seems like it would work.
>
> It's the same API that we have now.  What's complex is trying to figure
> out all the subtle side-effects from the loopholes that are being
> proposed in this series.  Thanks,

I might agree with you if we weren't now going backwards - 
ideas didn't work out and Yi has to throw stuff away. :(

Jason