> BTW: update some findings regarding to the previous bug with Nvidia GPU > assignment: > I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not > reproducible when only leaf entries of memslot are zapped. > (no more detailed info due to limited time to debug). +Alex, Weijiang, and Kevin Some updates on the Nvidia GPU assignment issue. Good news is that I may have identified the root cause of this issue. However, given the root cause, I'm not 100% sure that the issue I observed is the same one reported by Alex. So it still needs Alex's confirmation and help to verify it in the original environment. == My Environment == With the help from Weijiang, I'm able to reproduce the issue using GeForce GT 640, on a KBL desktop. Besides the GeForce GT 640 assigned to the guest, this KBL desktop has an Intel IGD device, which is used by host OS. The guest OS is win10. Guest workloads: a video player + furmark + passmark. I can observe error patterns that are very similar to those described by Alex at [1] on kernel tags before v5.19-rc1. - I can observe the error patterns on kernel tag v5.3-rc4. (It uses the zap-only-memslot logic and Alex reported that this version was with this issue at [2]). - From tag 5.3-rc6 to v5.19-rc1, zap-only-memslot was reverted. From tag v5.4-rc8, commit b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") was introduced. (though if I directly checkout this commit, the kernel version is 5.4.0-rc6). I can reproduce the issue on those kernel versions by adding back and forcing the zap-only-memslot logic, and setting kvm.nx_huge_pages=N. (Previously Weijiang found out that with kvm.nx_huge_pages=Y, the issue was not reproducible [3]). - If I switched back to zap-all in all those versions, the error pattens were not observable. == Root Cause == It's found out that with commit fc0051cb9590 ("iommu/vt-d: Check domain force_snooping against attached devices"), the issue was not reproducible. (I only bisected kernel tags. This commit first appeared in tag v5.19-rc1.) Further analysis (with Kevin's help) shows that after the commit fc0051cb9590 ("iommu/vt-d: Check domain force_snooping against attached devices"), VFIO always detected the NVidia GPU device as a coherent DMA device. Prior to that commit, VFIO detected the NVidia GPU device as a non-coherent DMA device by querying cache coherency from Intel IOMMU driver, which, however, incorrectly returned fail if any IOMMU lacked snoop control support. As a result, if the machine had an Intel IGD device, - on the Intel IOMMU driver side, it would not enforce snoop for the assigned NVidia GPU device in the IOMMU SLPT. - on the KVM's side, KVM also found that kvm_arch_has_noncoherent_dma() was true and would emulate guest WBINVD. In KVM's vmx_get_mt_mask(), with non-coherent DMA devices attached, (using the code in tag v5.3-rc4 as an example): - when guest CD=1 && kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED), the EPT memtype is MTRR_TYPE_WRBACK | VMX_EPT_IPAT_BIT; - when CD=0, the EPT memtype is guest MTRR type (without VMX_EPT_IPAT_BIT). static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { u8 cache; u64 ipat = 0; if (is_mmio) { cache = MTRR_TYPE_UNCACHABLE; goto exit; } if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) { ipat = VMX_EPT_IPAT_BIT; cache = MTRR_TYPE_WRBACK; goto exit; } if (kvm_read_cr0(vcpu) & X86_CR0_CD) { ipat = VMX_EPT_IPAT_BIT; if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) cache = MTRR_TYPE_WRBACK; else cache = MTRR_TYPE_UNCACHABLE; goto exit; } cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn); exit: return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat; } However, with this vmx_get_mt_mask() implementation, KVM did not zap EPT on CD toggling. So if I applied patch[4], the error pattens previously observed were immediately gone and the guest OS appeared quite stable. Or if I changed vmx_get_mt_mask() as shown below, the issue was not reproducible even if KVM did not zap EPT for CD toggling and update_mtrr(). static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { if (is_mmio) return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); } So, my conclusion is that the Nvidia GPU assignment issue was caused by the lack of EPT zapping when the guest toggles CD. (The CD toggling occurs per-vCPU during guest bootup for enabling guest MTRRs.) The lack of EPT zapping was previously masked by the zap-all operations for memslot deletions during guest bootup. However, the error became outstanding when only memslot EPT entries were zapped. (The guest may have accessed a GPA during CD=1 to create an EPT entry with a memtype no longer correct after CD=0). The ITLB_MULTIHIT mitigation [3] splits non-executable huge pages in EPT to create executable 4k pages. e.g., I can observe GFNs 0xa00, 0xc00 were mapped as 2M initially with EPT memtype=WB. They were then mapped as 2M + EPT memtype=WB+IPAT when guest CD=1. After some seconds during guest boot, they were split to 4K + EPT memtype=WB. The split may also mitigate the lack of zapping for CD toggling to a great extent. In my environment, the guest appeared quite stable with "zap-only-memslot + kvm.nx_huge_pages=Y". However, the benchmarks sometimes still showed around 10 errors in that case, compared to 1000+ errors with "zap-only-memslot + kvm.nx_huge_pages=N". == Request Help == So, Alex, do you recall if there was an IGD device in your original environment? If so and if that environment is still available, could you please help verify if patch [4] resolves the issue? Thank you and your help is greatly appreciated! [1] https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#mc45b9f909731d70551b4e10cff5a58d34a155e71 [2] https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@xxxxxxxxx/ [3] https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [4]