On Wed, 2 Jun 2021 16:54:04 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote: > > > > Right, the device can generate the no-snoop transactions, but it's the > > IOMMU that essentially determines whether those transactions are > > actually still cache coherent, AIUI. > > Wow, this is really confusing stuff in the code. > > At the PCI level there is a TLP bit called no-snoop that is platform > specific. The general intention is to allow devices to selectively > bypass the CPU caching for DMAs. GPUs like to use this feature for > performance. Yes > I assume there is some exciting security issues here. Looks like > allowing cache bypass does something bad inside VMs? Looks like > allowing the VM to use the cache clear instruction that is mandatory > with cache bypass DMA causes some QOS issues? OK. IIRC, largely a DoS issue if userspace gets to choose when to emulate wbinvd rather than it being demanded for correct operation. > So how does it work? > > What I see in the intel/iommu.c is that some domains support "snoop > control" or not, based on some HW flag. This indicates if the > DMA_PTE_SNP bit is supported on a page by page basis or not. > > Since x86 always leans toward "DMA cache coherent" I'm reading some > tea leaves here: > > IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA > transactions */ > > And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the > snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set? That's my understanding as well. > Further, I guess IOMMUs that don't support PTE_SNP, or have > DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat > and all) Yes. > So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA > incoherent with the CPU caches, it just means that that snoop bit in > the TLP cannot be enforced. ie the device *could* do no-shoop DMA > if it wants. Devices that never do no-snoop remain DMA coherent on > x86, as they always have been. Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device DMA to be coherent via the IOMMU. > IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means > the PCI device is blocked from using no-snoop in its TLPs. > > I wonder if ARM implemented this consistently? I see VDPA is > confused.. I was confused. What a terrible set of names. > > In VFIO generic code I see it always sets IOMMU_CACHE: > > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) > domain->prot |= IOMMU_CACHE; > > And thus also always provides IOMMU_CACHE to iommu_map: > > ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT, > npage << PAGE_SHIFT, prot | d->prot); > > So when the IOMMU supports the no-snoop blocking security feature VFIO > turns it on and blocks no-snoop to all pages? Ok.. Yep, I'd forgotten this nuance that we need to enable it via the mapping flags. > But I must be missing something big because *something* in the IOVA > map should work with no-snoopable DMA, right? Otherwise what is the > point of exposing the invalidate instruction to the guest? > > I would think userspace should be relaying the DMA_PTE_SNP bit from > the guest's page tables up to here?? > > The KVM hookup is driven by IOMMU_CACHE which is driven by > IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only > if the IOMMU can block the SNP bit? And then we map all the pages to > block the snoop bit? Huh? Right. I don't follow where you're jumping to relaying DMA_PTE_SNP from the guest page table... what page table? We don't necessarily have a vIOMMU to expose such things, I don't think it even existed when this we added. Essentially if we can ignore no-snoop at the IOMMU, then KVM doesn't need to worry about emulating wbinvd because of an assigned device, whether that device uses it or not. Win-win. > Your explanation makes perfect sense: Block guests from using the > dangerous cache invalidate instruction unless a device that uses > no-snoop is plugged in. Block devices from using no-snoop because > something about it is insecure. Ok. No-snoop itself is not insecure, but to support no-snoop in a VM KVM can't ignore wbinvd, which has overhead and abuse implications. > But the conditions I'm looking for "device that uses no-snoop" is: > - The device will issue no-snoop TLPs at all We can't really know this generically. We can try to set the enable bit to see if the device is capable of no-snoop, but that doesn't mean it will use no-snoop. > - The IOMMU will let no-snoop through > - The platform will honor no-snoop > > Only if all three are met we should allow the dangerous instruction in > KVM, right? We test at the IOMMU and assume that the IOMMU knowledge encompasses whether the platform honors no-snoop (note for example how amd and arm report true for IOMMU_CAP_CACHE_COHERENCY but seem to ignore the IOMMU_CACHE flag). We could probably use an iommu_group_for_each_dev to test if any devices within the group are capable of no-snoop if the IOMMU can't protect us, but at the time it didn't seem worthwhile. I'm still not sure if it is. > Which brings me back to my original point - this is at least partially > a device specific behavior. It depends on the content of the IOMMU > page table, it depends if the device even supports no-snoop at all. > > My guess is this works correctly for the mdev Intel kvmgt which > probably somehow allows no-snoop DMA throught the mdev SW iommu > mappings. (assuming I didn't miss a tricky iommu_map without > IOMMU_CACHe set in the type1 code?) This support existed before mdev, IIRC we needed it for direct assignment of NVIDIA GPUs. > But why is vfio-pci using it? Hmm? Use the IOMMU to reduce hypervisor overhead, let the hypervisor learn about it, ignore the subtleties of whether the device actually uses no-snoop as imprecise and poor ROI given the apparent direction of hardware. ¯\_(ツ)_/¯, Alex