On Fri, 4 Jun 2021 14:22:07 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote: > > On 04/06/21 18:03, Jason Gunthorpe wrote: > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote: > > > > I don't want a security proof myself; I want to trust VFIO to make the right > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device). > > > > > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly > > > > more roundabout way to do stuff that is accessible to bare metal; not a way > > > > to gain extra privilege. > > > > > > Okay, fine, lets turn the question on its head then. > > > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO > > > application can make use of no-snoop optimizations. The ability of KVM > > > to execute wbinvd should be tied to the ability of that IOCTL to run > > > in a normal process context. > > > > > > So, under what conditions do we want to allow VFIO to giave a process > > > elevated access to the CPU: > > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl, > > which then would be on VFIO and not on KVM. > > At the end of the day we need an ioctl with two arguments: > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever) > - The KVM FD to control wbinvd support on > > Philosophically it doesn't matter too much which subsystem that ioctl > lives, but we have these obnoxious cross module dependencies to > consider.. > > Framing the question, as you have, to be about the process, I think > explains why KVM doesn't really care what is decided, so long as the > process and the VM have equivalent rights. > > Alex, how about a more fleshed out suggestion: > > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID > it communicates its no-snoop configuration: Communicates to whom? > - 0 enable, allow WBINVD > - 1 automatic disable, block WBINVD if the platform > IOMMU can police it (what we do today) > - 2 force disable, do not allow BINVD ever The only thing we know about the device is whether or not Enable No-snoop is hard wired to zero, ie. it either can't generate no-snoop TLPs ("coherent-only") or it might ("assumed non-coherent"). If we're putting the policy decision in the hands of userspace they should have access to wbinvd if they own a device that is assumed non-coherent AND it's attached to an IOMMU (page table) that is not blocking no-snoop (a "non-coherent IOASID"). I think that means that the IOASID needs to be created (IOASID_ALLOC) with a flag that specifies whether this address space is coherent (IOASID_GET_INFO probably needs a flag/cap to expose if the system supports this). All mappings in this IOASID would use IOMMU_CACHE and and devices attached to it would be required to be backed by an IOMMU capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise). If only these IOASIDs exist, access to wbinvd would not be provided. (How does a user provided page table work? - reserved bit set, user error?) Conversely, a user could create a non-coherent IOASID and attach any device to it, regardless of IOMMU backing capabilities. Only if an assumed non-coherent device is attached would the wbinvd be allowed. I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD and the IOASID world needs to understand the device's ability to generate non-coherent DMA. This wbinvd ioctl would be a no-op (or some known errno) unless a non-coherent IOASID exists with a potentially non-coherent device attached. > vfio_pci may want to take this from an admin configuration knob > someplace. It allows the admin to customize if they want. > > If we can figure out a way to autodetect 2 from vfio_pci, all the > better > > 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace > to access wbinvd so it can make use of the no snoop optimization. > > wbinvd is allowed when: > - A device is joined with mode #0 > - A device is joined with mode #1 and the IOMMU cannot block > no-snoop (today) > > 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD > is blocked and userspace doesn't request to block no-snoop in the > IOASID then it is a userspace error. In my model above, the IOASID is central to this. > 4) The KVM interface is the very simple enable/disable WBINVD. > Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required > to enable WBINVD at KVM. Right, and in the new world order, vfio is only a device driver, the IOASID manages the device's DMA. wbinvd is only necessary relative to non-coherent DMA, which seems like QEMU needs to bump KVM with an ioasidfd. > It is pretty simple from a /dev/ioasid perpsective, covers todays > compat requirement, gives some future option to allow the no-snoop > optimization, and gives a new option for qemu to totally block wbinvd > no matter what. What do you imagine is the use case for totally blocking wbinvd? In the model I describe, wbinvd would always be a no-op/known-errno when the IOASIDs are all allocated as coherent or a non-coherent IOASID has only coherent-only devices attached. Does userspace need a way to prevent itself from scenarios where wbvind is not a no-op? In general I'm having trouble wrapping my brain around the semantics of the enable/automatic/force-disable wbinvd specific proposal, sorry. Thanks, Alex