Re: [RFC] /dev/ioasid uAPI proposal

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 4 Jun 2021 20:01:08 -0300

On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:  
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:  
> > > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > 
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > > to gain extra privilege.  
> > > > 
> > > > Okay, fine, lets turn the question on its head then.
> > > > 
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > > 
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:  
> > > 
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.    
> > 
> > At the end of the day we need an ioctl with two arguments:
> >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> >  - The KVM FD to control wbinvd support on
> > 
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider.. 
> > 
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> > 
> > Alex, how about a more fleshed out suggestion:
> > 
> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> >     it communicates its no-snoop configuration:
> 
> Communicates to whom?

To the /dev/iommu FD which will have to maintain a list of devices
attached to it internally.

> >      - 0 enable, allow WBINVD
> >      - 1 automatic disable, block WBINVD if the platform
> >        IOMMU can police it (what we do today)
> >      - 2 force disable, do not allow BINVD ever
> 
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent").  

Here I am outlining the choice an also imagining we might want an
admin knob to select the three.

> If we're putting the policy decision in the hands of userspace they
> should have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").

There are two parts here, like Paolo was leading too. If the process
has access to WBINVD and then if such an allowed process tells KVM to
turn on WBINVD in the guest.

If the process has a device and it has a way to create a non-coherent
IOASID, then that process has access to WBINVD.

For security it doesn't matter if the process actually creates the
non-coherent IOASID or not. An attacker will simply do the steps that
give access to WBINVD.

The important detail is that access to WBINVD does not compell the
process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
can still choose to create a secure guest by always using IOMMU_CACHE
in its page tables and not asking KVM to enable WBINVD.

This propsal shifts this policy decision from the kernel to userspace.
qemu is responsible to determine if KVM should enable wbinvd or not
based on if it was able to create IOASID's with IOMMU_CACHE.

> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities.  Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.

Right, this is exactly the point. Since the user gets to pick if the
IOASID is coherent or not then an attacker can always reach WBINVD
using only the device FD. Additional checks don't add to the security
of the process.

The additional checks you are describing add to the security of the
guest, however qemu is capable of doing them without more help from the
kernel.

It is the strenth of Paolo's model that KVM should not be able to do
optionally less, not more than the process itself can do.

> > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > compat requirement, gives some future option to allow the no-snoop
> > optimization, and gives a new option for qemu to totally block wbinvd
> > no matter what.
> 
> What do you imagine is the use case for totally blocking wbinvd? 

If wbinvd is really security important then an operator should endevor
to turn it off. It can be safely turned off if the operator
understands the SRIOV devices they are using. ie if you are only using
mlx5 or a nvme then force it off and be secure, regardless of the
platform capability.

Jason