Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 17 Apr 2024 14:25:52 -0600

On Wed, 17 Apr 2024 07:09:52 +0000
"Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:

> > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > Sent: Wednesday, April 17, 2024 1:57 AM
> > 
> > On Fri, 12 Apr 2024 01:21:21 -0700
> > Yi Liu <yi.l.liu@xxxxxxxxx> wrote:
> >   
> > > + */
> > > +struct vfio_device_feature_pasid {
> > > +	__u16 capabilities;
> > > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > > +	__u8 width;
> > > +	__u8 __reserved;
> > > +};  
> > 
> > Building on Kevin's comment on the cover letter, if we could describe
> > an offset for emulating a PASID capability, this seems like the place
> > we'd do it.  I think we're not doing that because we'd like an in-band
> > mechanism for a device to report unused config space, such as a DVSEC
> > capability, so that it can be implemented on a physical device.  As
> > noted in the commit log here, we'd also prefer not to bloat the kernel
> > with more device quirks.
> > 
> > In an ideal world we might be able to jump start support of that DVSEC
> > option by emulating the DVSEC capability on top of the PASID capability
> > for PFs, but unfortunately the PASID capability is 8 bytes while the
> > DVSEC capability is at least 12 bytes, so we can't implement that
> > generically either.  
> 
> Yeah, that's a problem.
> 
> > 
> > I don't know there's any good solution here or whether there's actually
> > any value to the PASID capability on a PF, but do we need to consider
> > leaving a field+flag here to describe the offset for that scenario?  
> 
> Yes, I prefer to this way.
> 
> > Would we then allow variant drivers to take advantage of it?  Does this
> > then turn into the quirk that we're trying to avoid in the kernel
> > rather than userspace and is that a problem?  Thanks,
> >   
> 
> We don't want to proactively pursue quirks in the kernel.
> 
> But if a variant driver exists for other reasons, I don't see why it 
> should be prohibited from deciding an offset to ease the
> userspace. 😊

At that point we've turned the corner into an arbitrary policy decision
that I can't defend.  A "worthy" variant driver can implement something
through a side channel vfio API, but implementing that side channel
itself is not enough to justify a variant driver?  It doesn't make
sense.

Further, if we have a variant driver, why do we need a side channel for
the purpose of describing available config space when we expect devices
themselves to eventually describe the same through a DVSEC capability?
The purpose of enabling variant drivers is to enhance the functionality
of the device.  Adding an emulated DVSEC capability seems like a valid
enhancement to justify a variant driver to me.

So the more I think about it, it would be easy to add something here
that hints a location for an emulated PASID capability in the VMM, but
it would also be counterproductive to an end goal of having a DVSEC
capability that describes unused config space.  The very narrow scope
where that side-band channel would be useful is an unknown PF device
which doesn't implement a DVSEC capability and without intervention
simply behaves as it always has, without PASID support.

A vendor desiring such support can a) implement DVSEC in the hardware,
b) implement a variant driver emulating a DVSEC capability, or c)
directly modify the VMM to tell it where to place the PASID capability.
I also don't think we should exclude the possibility that b) could turn
into a shared variant driver that knows about multiple devices and has
a table of free config space for each.  Option c) is only the last
resort if there's not already 12 bytes of contiguous, aligned free
space to place a DVSEC capability.  That seems unlikely.

At some point we need to define the format and use of this DVSEC.  Do
we allow (not require) one at every gap in config space that's at least
12-bytes long and adjust the DVSEC Length to describe longer gaps, or do
we use a single DVSEC to describe a table of ranges throughout extended
(maybe even conventional) config space?  The former seems easier,
especially if we expect a device has a large block of free space,
enough for multiple emulated capabilities and described by a single
DVSEC.  Thanks,

Alex