Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 18 Apr 2024 14:37:47 -0600

On Thu, 18 Apr 2024 17:03:15 +0800
Yi Liu <yi.l.liu@xxxxxxxxx> wrote:

> On 2024/4/18 08:06, Tian, Kevin wrote:
> >> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> >> Sent: Thursday, April 18, 2024 7:02 AM
> >>
> >> On Wed, 17 Apr 2024 09:20:51 -0300
> >> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >>  
> >>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:  
> >>>>> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> >>>>> Sent: Wednesday, April 17, 2024 1:50 AM
> >>>>>
> >>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:  
> >>>>>>> From: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> >>>>>>> Sent: Friday, April 12, 2024 4:21 PM
> >>>>>>>
> >>>>>>> A userspace VMM is supposed to get the details of the device's  
> >> PASID  
> >>>>>>> capability
> >>>>>>> and assemble a virtual PASID capability in a proper offset in the  
> >> virtual  
> >>>>> PCI  
> >>>>>>> configuration space. While it is still an open on how to get the  
> >> available  
> >>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap  
> >> chain.  
> >>>>> For  
> >>>>>>> now, there are two options to get the available offsets.[2]
> >>>>>>>
> >>>>>>> - Report the available offsets via ioctl. This requires device-specific  
> >> logic  
> >>>>>>>    to provide available offsets. e.g., vfio-pci variant driver. Or may the  
> >>>>> device  
> >>>>>>>    provide the available offset by DVSEC.
> >>>>>>> - Store the available offsets in a static table in userspace VMM.  
> >> VMM gets  
> >>>>> the  
> >>>>>>>    empty offsets from this table.
> >>>>>>>  
> >>>>>>
> >>>>>> I'm not a fan of requesting a variant driver for every PASID-capable
> >>>>>> VF just for the purpose of reporting a free range in the PCI config  
> >> space.  
> >>>>>>
> >>>>>> It's easier to do that quirk in userspace.
> >>>>>>
> >>>>>> But I like Alex's original comment that at least for PF there is no  
> >> reason  
> >>>>>> to hide the offset. there could be a flag+field to communicate it. or
> >>>>>> if there will be a new variant VF driver for other purposes e.g.  
> >> migration  
> >>>>>> it can certainly fill the field too.  
> >>>>>
> >>>>> Yes, since this has been such a sticking point can we get a clean
> >>>>> series that just enables it for PF and then come with a solution for
> >>>>> VF?
> >>>>>  
> >>>>
> >>>> sure but we at least need to reach consensus on a minimal required
> >>>> uapi covering both PF/VF to move forward so the user doesn't need
> >>>> to touch different contracts for PF vs. VF.  
> >>>
> >>> Do we? The situation where the VMM needs to wholly make a up a PASID
> >>> capability seems completely new and seperate from just using an
> >>> existing PASID capability as in the PF case.  
> >>
> >> But we don't actually expose the PASID capability on the PF and as
> >> argued in path 4/ we can't because it would break existing userspace.
> > > Come back to this statement.  
> > 
> > Does 'break' means that legacy Qemu will crash due to a guest write
> > to the read-only PASID capability, or just a conceptually functional
> > break i.e. non-faithful emulation due to writes being dropped?

I expect more the latter.

> > If the latter it's probably not a bad idea to allow exposing the PASID
> > capability on the PF as a sane guest shouldn't enable the PASID
> > capability w/o seeing vIOMMU supporting PASID. And there is no
> > status bit defined in the PASID capability to check back so even
> > if an insane guest wants to blindly enable PASID it will naturally
> > write and done. The only niche case is that the enable bits are
> > defined as RW so ideally reading back those bits should get the
> > latest written value. But probably this can be tolerated?

Some degree of inconsistency is likely tolerated, the guest is unlikely
to check that a RW bit was set or cleared.  How would we virtualize the
control registers for a VF and are they similarly virtualized for a PF
or would we allow the guest to manipulate the physical PASID control
registers?

> > With that then should we consider exposing the PASID capability
> > in PCI config space as the first option? For PF it's simple as how
> > other caps are exposed. For VF a variant driver can also fake the
> > PASID capability or emulate a DVSEC capability for unused space
> > (to motivate the physical implementation so no variant driver is
> > required in the future)  
> 
> If kernel exposes pasid cap for PF same as other caps, and in the meantime
> the variant driver chooses to emulate a DVSEC cap, then userspace follows
> the below steps to expose pasid cap to VM.

If we have a variant driver, why wouldn't it expose an emulated PASID
capability rather than a DVSEC if we're choosing to expose PASID for
PFs?

> 
> 1) Check if a pasid cap is already present in the virtual config space
>     read from kernel. If no, but user wants pasid, then goto step 2).
> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
>     pasid cap. If yes, goto step 3).

Why do we need the vfio feature interface if a physical or virtual PASID
capability on the device exposes the same info?

> 3) Userspace gets an available offset via reading the DVSEC cap.

What's the scenario where we'd have a VF wanting to expose PASID
support which doesn't also have a variant driver that could implement a
virtual PASID?

> 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> 
> For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> This is a bit different from what we planned at the beginning. But sounds
> doable if we want to pursue the staging direction.

Seems like if we decide that we can just expose the PASID capability
for a PF then we should just have any VF variant drivers also implement
a virtual PASID capability.  In this case DVSEC would only be used to
provide information for a purely userspace emulation of PASID (in which
case it also wouldn't necessarily need the vfio feature because it
might implicitly know the PASID capabilities of the device).  Thanks,

Alex