Re: [FYI PATCH] Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 25 Mar 2022 20:18:32 +0000

On Fri, Mar 25, 2022, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
> > Actually, since this is apparently specific to kvm_zap_gfn_range(), can you add
> > printk "tracing" in update_mtrr(), kvm_post_set_cr0(), and __kvm_request_apicv_update()
> > to see what is actually triggering zaps?  Capturing the start and end GFNs would be very
> > helpful for the MTRR case.
> >
> > The APICv update seems unlikely to affect only Hyper-V guests, though there is the auto
> > EOI crud.  And the other two only come into play with non-coherent DMA.  In other words,
> > figuring out exactly what sequence leads to failure should be straightforward.
> 
> The tricky part here is that Hyper-V doesn't crash immediately, the
> crash is always different (if you look at the BSOD) and happens at
> different times. Crashes mention various stuff like trying to execute
> non-executable memory, ...
> 
> I've added tracing you've suggested:
> - __kvm_request_apicv_update() happens only once in the very beginning.

And thinking through this again, APICv changes should never result in a shadow
page being zapped as they'll only zap a 4k range.

> - update_mtrr() never actually reaches kvm_zap_gfn_range()
> 
> - kvm_post_set_cr0() happen in early boot but the crash happen much much
>   later. E.g.:

Ah rats, I got the sequencing of the revert messed up.  mmu_notifier is also in
play, via kvm_tdp_mmu_unmap_gfn_range().

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4fa4d8269e5b..db7c5a05e574 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -870,6 +870,8 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>  
>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>  {
> +       trace_printk("vCPU %d %lx %lx\n", vcpu->vcpu_id, old_cr0, cr0);

This doesn't guarantee kvm_zap_gfn_range() will be reached.   The guest has to
have non-coherente DMA and be running with the CD/NW memtyp quirk.  Moving the
print inside the if statement would show if KVM is actually zapping in those
cases.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3a9ce07a565..25c7d8fc3287 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -887,8 +887,10 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon

        if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
            kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
-           !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
+           !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
+               trace_printk("vCPU %d %lx %lx\n", vcpu->vcpu_id, old_cr0, cr0);
                kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+       }
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);

> kvm_hv_set_msr_pw() call is when Hyper-V writes to HV_X64_MSR_CRASH_CTL
> ('hv-crash' QEMU flag is needed to enable the feature). The debug patch
> is:
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index a32f54ab84a2..59a72f6ced99 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1391,6 +1391,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
>  
>                         /* Send notification about crash to user space */
>                         kvm_make_request(KVM_REQ_HV_CRASH, vcpu);
> +                       trace_printk("%d\n", vcpu->vcpu_id);
>                 }
>                 break;
>         case HV_X64_MSR_RESET:
> 
> So it's 20 seconds (!) between the last kvm_post_set_cr0() call and the
> crash. My (disappointing) conclusion is: the problem can be anywhere and
> Hyper-V detects it much much later.

And reproduced... 'twas indeed the mmu_notifier.  Hyper-V 2019 booted just fine,
until I turned on KSM and cranked up the scanning.

The bug has nothing to do with zapping only leafs, it's a simple goof where the
TLB flush gets lost.  Not sure why only Hyper-V detects the issue; maybe because
it maintains a pool of zeroed pages that are KSM-friendly?

I'll send a patch to reintroduce the reverted code.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c71debdbc732..a641737725d1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -885,7 +885,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
        struct kvm_mmu_page *root;

        for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);

        return flush;
 }