Re: [PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Tue, 26 Nov 2024 17:15:57 +0800

> BTW: update some findings regarding to the previous bug with Nvidia GPU
> assignment:
> I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not
> reproducible when only leaf entries of memslot are zapped.
> (no more detailed info due to limited time to debug).
+Alex, Weijiang, and Kevin

Some updates on the Nvidia GPU assignment issue.
Good news is that I may have identified the root cause of this issue.
However, given the root cause, I'm not 100% sure that the issue I observed is
the same one reported by Alex. So it still needs Alex's confirmation and help to
verify it in the original environment.

== My Environment ==
With the help from Weijiang, I'm able to reproduce the issue using
GeForce GT 640, on a KBL desktop.
Besides the GeForce GT 640 assigned to the guest, this KBL desktop has an Intel
IGD device, which is used by host OS.
The guest OS is win10. Guest workloads: a video player + furmark + passmark.

I can observe error patterns that are very similar to those described by Alex
at [1] on kernel tags before v5.19-rc1.
- I can observe the error patterns on kernel tag v5.3-rc4.
  (It uses the zap-only-memslot logic and Alex reported that this version was
   with this issue at [2]).
- From tag 5.3-rc6 to v5.19-rc1, zap-only-memslot was reverted.
  From tag v5.4-rc8, commit b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
  was introduced. (though if I directly checkout this commit, the kernel version
  is 5.4.0-rc6).
  I can reproduce the issue on those kernel versions by adding back and forcing
  the zap-only-memslot logic, and setting kvm.nx_huge_pages=N.
  (Previously Weijiang found out that with kvm.nx_huge_pages=Y, the issue was
   not reproducible [3]).
- If I switched back to zap-all in all those versions, the error pattens were
  not observable.

== Root Cause ==
It's found out that with commit fc0051cb9590 ("iommu/vt-d: Check domain
force_snooping against attached devices"), the issue was not reproducible.
(I only bisected kernel tags. This commit first appeared in tag v5.19-rc1.)

Further analysis (with Kevin's help) shows that after the commit fc0051cb9590
("iommu/vt-d: Check domain force_snooping against attached devices"), VFIO
always detected the NVidia GPU device as a coherent DMA device. Prior to that
commit, VFIO detected the NVidia GPU device as a non-coherent DMA device by
querying cache coherency from Intel IOMMU driver, which, however, incorrectly
returned fail if any IOMMU lacked snoop control support. 

As a result, if the machine had an Intel IGD device,
- on the Intel IOMMU driver side, it would not enforce snoop for the assigned
  NVidia GPU device in the IOMMU SLPT.
- on the KVM's side, KVM also found that kvm_arch_has_noncoherent_dma() was true
  and would emulate guest WBINVD.

In KVM's vmx_get_mt_mask(), with non-coherent DMA devices attached,
(using the code in tag v5.3-rc4 as an example):
- when guest CD=1 && kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED),
  the EPT memtype is MTRR_TYPE_WRBACK | VMX_EPT_IPAT_BIT;
- when CD=0, the EPT memtype is guest MTRR type (without VMX_EPT_IPAT_BIT).

static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
        u8 cache;
        u64 ipat = 0;

        if (is_mmio) {
                cache = MTRR_TYPE_UNCACHABLE;
                goto exit;
        }

        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
                ipat = VMX_EPT_IPAT_BIT;
                cache = MTRR_TYPE_WRBACK;
                goto exit;
        }

        if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
                ipat = VMX_EPT_IPAT_BIT;
                if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
                        cache = MTRR_TYPE_WRBACK;
                else
                        cache = MTRR_TYPE_UNCACHABLE;
                goto exit;
        }

        cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);

exit:
        return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
}

However, with this vmx_get_mt_mask() implementation, KVM did not zap EPT on CD
toggling.
So if I applied patch[4], the error pattens previously observed were immediately
gone and the guest OS appeared quite stable.

Or if I changed vmx_get_mt_mask() as shown below, the issue was not reproducible
even if KVM did not zap EPT for CD toggling and update_mtrr().

static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
        if (is_mmio)
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

        return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
}

So, my conclusion is that the Nvidia GPU assignment issue was caused by the lack
of EPT zapping when the guest toggles CD. (The CD toggling occurs per-vCPU
during guest bootup for enabling guest MTRRs.)
The lack of EPT zapping was previously masked by the zap-all operations for
memslot deletions during guest bootup. However, the error became outstanding
when only memslot EPT entries were zapped. (The guest may have accessed a GPA
during CD=1 to create an EPT entry with a memtype no longer correct after CD=0).

The ITLB_MULTIHIT mitigation [3] splits non-executable huge pages in EPT to
create executable 4k pages. e.g., I can observe GFNs 0xa00, 0xc00 were mapped as
2M initially with EPT memtype=WB. They were then mapped as 2M + EPT
memtype=WB+IPAT when guest CD=1. After some seconds during guest boot, they were
split to 4K + EPT memtype=WB. The split may also mitigate the lack of zapping
for CD toggling to a great extent.
In my environment, the guest appeared quite stable with
"zap-only-memslot + kvm.nx_huge_pages=Y". However, the benchmarks sometimes
still showed around 10 errors in that case, compared to 1000+ errors with
"zap-only-memslot + kvm.nx_huge_pages=N".

== Request Help ==
So, Alex, do you recall if there was an IGD device in your original environment?
If so and if that environment is still available, could you please help verify
if patch [4] resolves the issue?

Thank you and your help is greatly appreciated!

[1] https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#mc45b9f909731d70551b4e10cff5a58d34a155e71
[2] https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@xxxxxxxxx/
[3] https://lore.kernel.org/kvm/20200713190649.GE29725@xxxxxxxxxxxxxxx/T/#m1839c85392a7a022df9e507876bb241c022c4f06
[4]