Re: [RFC] /dev/ioasid uAPI proposal

Alex Williamson <alex.williamson@xxxxxxxxxx> · Fri, 4 Jun 2021 15:29:18 -0600

On Fri, 4 Jun 2021 14:22:07 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > On 04/06/21 18:03, Jason Gunthorpe wrote:  
> > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:  
> > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > 
> > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > to gain extra privilege.  
> > > 
> > > Okay, fine, lets turn the question on its head then.
> > > 
> > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > application can make use of no-snoop optimizations. The ability of KVM
> > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > in a normal process context.
> > > 
> > > So, under what conditions do we want to allow VFIO to giave a process
> > > elevated access to the CPU:  
> > 
> > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > which then would be on VFIO and not on KVM.    
> 
> At the end of the day we need an ioctl with two arguments:
>  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
>  - The KVM FD to control wbinvd support on
> 
> Philosophically it doesn't matter too much which subsystem that ioctl
> lives, but we have these obnoxious cross module dependencies to
> consider.. 
> 
> Framing the question, as you have, to be about the process, I think
> explains why KVM doesn't really care what is decided, so long as the
> process and the VM have equivalent rights.
> 
> Alex, how about a more fleshed out suggestion:
> 
>  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
>     it communicates its no-snoop configuration:

Communicates to whom?

>      - 0 enable, allow WBINVD
>      - 1 automatic disable, block WBINVD if the platform
>        IOMMU can police it (what we do today)
>      - 2 force disable, do not allow BINVD ever

The only thing we know about the device is whether or not Enable
No-snoop is hard wired to zero, ie. it either can't generate no-snoop
TLPs ("coherent-only") or it might ("assumed non-coherent").  If
we're putting the policy decision in the hands of userspace they should
have access to wbinvd if they own a device that is assumed
non-coherent AND it's attached to an IOMMU (page table) that is not
blocking no-snoop (a "non-coherent IOASID").

I think that means that the IOASID needs to be created (IOASID_ALLOC)
with a flag that specifies whether this address space is coherent
(IOASID_GET_INFO probably needs a flag/cap to expose if the system
supports this).  All mappings in this IOASID would use IOMMU_CACHE and
and devices attached to it would be required to be backed by an IOMMU
capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise).  If only
these IOASIDs exist, access to wbinvd would not be provided.  (How does
a user provided page table work? - reserved bit set, user error?)

Conversely, a user could create a non-coherent IOASID and attach any
device to it, regardless of IOMMU backing capabilities.  Only if an
assumed non-coherent device is attached would the wbinvd be allowed.

I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
and the IOASID world needs to understand the device's ability to
generate non-coherent DMA.  This wbinvd ioctl would be a no-op (or
some known errno) unless a non-coherent IOASID exists with a potentially
non-coherent device attached.

>     vfio_pci may want to take this from an admin configuration knob
>     someplace. It allows the admin to customize if they want.
> 
>     If we can figure out a way to autodetect 2 from vfio_pci, all the
>     better
> 
>  2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
>     to access wbinvd so it can make use of the no snoop optimization.
> 
>     wbinvd is allowed when:
>       - A device is joined with mode #0
>       - A device is joined with mode #1 and the IOMMU cannot block
>         no-snoop (today)
> 
>  3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
>     is blocked and userspace doesn't request to block no-snoop in the
>     IOASID then it is a userspace error.

In my model above, the IOASID is central to this.

>  4) The KVM interface is the very simple enable/disable WBINVD.
>     Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
>     to enable WBINVD at KVM.

Right, and in the new world order, vfio is only a device driver, the
IOASID manages the device's DMA.  wbinvd is only necessary relative to
non-coherent DMA, which seems like QEMU needs to bump KVM with an
ioasidfd.

> It is pretty simple from a /dev/ioasid perpsective, covers todays
> compat requirement, gives some future option to allow the no-snoop
> optimization, and gives a new option for qemu to totally block wbinvd
> no matter what.

What do you imagine is the use case for totally blocking wbinvd?  In
the model I describe, wbinvd would always be a no-op/known-errno when
the IOASIDs are all allocated as coherent or a non-coherent IOASID has
only coherent-only devices attached.  Does userspace need a way to
prevent itself from scenarios where wbvind is not a no-op?

In general I'm having trouble wrapping my brain around the semantics of
the enable/automatic/force-disable wbinvd specific proposal, sorry.
Thanks,

Alex