> From: Alex Williamson <alex.williamson@xxxxxxxxxx> > Sent: Thursday, June 3, 2021 10:51 AM > > On Wed, 2 Jun 2021 19:45:36 -0300 > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote: > > > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP > > > from the guest page table... what page table? > > > > I see my confusion now, the phrasing in your earlier remark led me > > think this was about allowing the no-snoop performance enhancement in > > some restricted way. > > > > It is really about blocking no-snoop 100% of the time and then > > disabling the dangerous wbinvd when the block is successful. > > > > Didn't closely read the kvm code :\ > > > > If it was about allowing the optimization then I'd expect the guest to > > enable no-snoopable regions via it's vIOMMU and realize them to the > > hypervisor and plumb the whole thing through. Hence my remark about > > the guest page tables.. > > > > So really the test is just 'were we able to block it' ? > > Yup. Do we really still consider that there's some performance benefit > to be had by enabling a device to use no-snoop? This seems largely a > legacy thing. Yes, there is indeed performance benefit for device to use no-snoop, e.g. 8K display and some imaging processing path, etc. The problem is that the IOMMU for such devices is typically a different one from the default IOMMU for most devices. This special IOMMU may not have the ability of enforcing snoop on no-snoop PCI traffic then this fact must be understood by KVM to do proper mtrr/pat/wbinvd virtualization for such devices to work correctly. > > > > This support existed before mdev, IIRC we needed it for direct > > > assignment of NVIDIA GPUs. > > > > Probably because they ignored the disable no-snoop bits in the control > > block, or reset them in some insane way to "fix" broken bioses and > > kept using it even though by all rights qemu would have tried hard to > > turn it off via the config space. Processing no-snoop without a > > working wbinvd would be fatal. Yeesh > > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I > > had.. > > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY > and !IOMMU_CAP_CACHE_COHERENCY > > domains. > > > > This doesn't actually matter. If you mix them together then kvm > > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP > > anywhere in this VM. > > > > This if two IOMMU's are joined together into a single /dev/ioasid > > then we can just make them both pretend to be > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE. > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then > we > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's > available based on the per domain support available. That gives us the > most consistent behavior, ie. we don't have VMs emulating wbinvd > because they used to have a device attached where the domain required > it and we can't atomically remap with new flags to perform the same as > a VM that never had that device attached in the first place. > > > 2) How to fit this part of kvm in some new /dev/ioasid world > > > > What we want to do here is iterate over every ioasid associated > > with the group fd that is passed into kvm. > > Yeah, we need some better names, binding a device to an ioasid (fd) but > then attaching a device to an allocated ioasid (non-fd)... I assume > you're talking about the latter ioasid. > > > Today the group fd has a single container which specifies the > > single ioasid so this is being done trivially. > > > > To reorg we want to get the ioasid from the device not the > > group (see my note to David about the groups vs device rational) > > > > This is just iterating over each vfio_device in the group and > > querying the ioasid it is using. > > The IOMMU API group interfaces is largely iommu_group_for_each_dev() > anyway, we still need to account for all the RIDs and aliases of a > group. > > > Or perhaps more directly: an op attaching the vfio_device to the > > kvm and having some simple helper > > '(un)register ioasid with kvm (kvm, ioasid)' > > that the vfio_device driver can call that just sorts this out. > > We could almost eliminate the device notion altogether here, use an > ioasidfd_for_each_ioasid() but we really want a way to trigger on each > change to the composition of the device set for the ioasid, which is > why we currently do it on addition or removal of a group, where the > group has a consistent set of IOMMU properties. Register a notifier > callback via the ioasidfd? Thanks, > When discussing I/O page fault support in another thread, the consensus is that an device handle will be registered (by user) or allocated (return to user) in /dev/ioasid when binding the device to ioasid fd. From this angle we can register {ioasid_fd, device_handle} to KVM and then call something like ioasidfd_device_is_coherent() to get the property. Anyway the coherency is a per-device property which is not changed by how many I/O page tables are attached to it. Thanks Kevin