> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Friday, October 1, 2021 6:24 AM > > On Thu, Sep 30, 2021 at 09:35:45AM +0000, Tian, Kevin wrote: > > > > The Intel functional issue is that Intel blocks the cache maintaince > > > ops from the VM and the VM has no way to self-discover that the cache > > > maintaince ops don't work. > > > > the VM doesn't need to know whether the maintenance ops > > actually works. > > Which is the whole problem. > > Intel has a design where the device driver tells the device to issue > non-cachable TLPs. > > The driver is supposed to know if it can issue the cache maintaince > instructions - if it can then it should ask the device to issue > no-snoop TLPs. > > For instance the same PCI driver on non-x86 should never ask the > device to issue no-snoop TLPs because it has no idea how to restore > cache coherence on eg ARM. > > Do you see the issue? This configuration where the hypervisor silently > make wbsync a NOP breaks the x86 architecture because the guest has no > idea it can no longer use no-snoop features. Thanks for explanation. But I still have one puzzle about the 'break' part. If hypervisor makes wbinvd a NOP then it will also set enforce_snoop bit in PTE to convert non-snoop packet to snoop. No function in the guest is broken, just the performance may lag. If performance matters then hypervisor configures IOMMU to allow non-snoop packet and then emulate wbinvd properly. The contract between vfio and kvm is to convey above requirement on how wbinvd is handled. But in both cases cache maintenance instructions are available from guest p.o.v and no coherency semantics would be violated. > > Using the IOMMU to forcibly prevent the device from issuing no-snoop > makes this whole issue of the broken wbsync moot. it's not prevent issuing. Instead, IOMMU converts non-snoop request to snoop. > > It is important to be really clear on what this is about - this is not > some idealized nice iommu feature - it is working around alot of > backwards compatability baggage that is probably completely unique to > x86. > > > > Other arches don't seem to have this specific problem... > > > > I think the key is whether other archs allow driver to decide DMA > > coherency and indirectly the underlying I/O page table format. > > If yes, then I don't see a reason why such decision should not be > > given to userspace for passthrough case. > > The choice all comes down to if the other arches have cache > maintenance instructions in the VM that *don't work* > Looks vfio always sets IOMMU_CACHE on all platforms as long as iommu supports it (true on all platforms except intel iommu which is dedicated for GPU): vfio_iommu_type1_attach_group() { ... if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) domain->prot |= IOMMU_CACHE; ... } Should above be set according to whether a device is coherent? Thanks Kevin