On Fri, Mar 25, 2022, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@xxxxxxxxxx> writes: > > Actually, since this is apparently specific to kvm_zap_gfn_range(), can you add > > printk "tracing" in update_mtrr(), kvm_post_set_cr0(), and __kvm_request_apicv_update() > > to see what is actually triggering zaps? Capturing the start and end GFNs would be very > > helpful for the MTRR case. > > > > The APICv update seems unlikely to affect only Hyper-V guests, though there is the auto > > EOI crud. And the other two only come into play with non-coherent DMA. In other words, > > figuring out exactly what sequence leads to failure should be straightforward. > > The tricky part here is that Hyper-V doesn't crash immediately, the > crash is always different (if you look at the BSOD) and happens at > different times. Crashes mention various stuff like trying to execute > non-executable memory, ... > > I've added tracing you've suggested: > - __kvm_request_apicv_update() happens only once in the very beginning. And thinking through this again, APICv changes should never result in a shadow page being zapped as they'll only zap a 4k range. > - update_mtrr() never actually reaches kvm_zap_gfn_range() > > - kvm_post_set_cr0() happen in early boot but the crash happen much much > later. E.g.: Ah rats, I got the sequencing of the revert messed up. mmu_notifier is also in play, via kvm_tdp_mmu_unmap_gfn_range(). > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 4fa4d8269e5b..db7c5a05e574 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -870,6 +870,8 @@ EXPORT_SYMBOL_GPL(load_pdptrs); > > void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0) > { > + trace_printk("vCPU %d %lx %lx\n", vcpu->vcpu_id, old_cr0, cr0); This doesn't guarantee kvm_zap_gfn_range() will be reached. The guest has to have non-coherente DMA and be running with the CD/NW memtyp quirk. Moving the print inside the if statement would show if KVM is actually zapping in those cases. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d3a9ce07a565..25c7d8fc3287 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -887,8 +887,10 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon if (((cr0 ^ old_cr0) & X86_CR0_CD) && kvm_arch_has_noncoherent_dma(vcpu->kvm) && - !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) + !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) { + trace_printk("vCPU %d %lx %lx\n", vcpu->vcpu_id, old_cr0, cr0); kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL); + } } EXPORT_SYMBOL_GPL(kvm_post_set_cr0); > kvm_hv_set_msr_pw() call is when Hyper-V writes to HV_X64_MSR_CRASH_CTL > ('hv-crash' QEMU flag is needed to enable the feature). The debug patch > is: > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c > index a32f54ab84a2..59a72f6ced99 100644 > --- a/arch/x86/kvm/hyperv.c > +++ b/arch/x86/kvm/hyperv.c > @@ -1391,6 +1391,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data, > > /* Send notification about crash to user space */ > kvm_make_request(KVM_REQ_HV_CRASH, vcpu); > + trace_printk("%d\n", vcpu->vcpu_id); > } > break; > case HV_X64_MSR_RESET: > > So it's 20 seconds (!) between the last kvm_post_set_cr0() call and the > crash. My (disappointing) conclusion is: the problem can be anywhere and > Hyper-V detects it much much later. And reproduced... 'twas indeed the mmu_notifier. Hyper-V 2019 booted just fine, until I turned on KSM and cranked up the scanning. The bug has nothing to do with zapping only leafs, it's a simple goof where the TLB flush gets lost. Not sure why only Hyper-V detects the issue; maybe because it maintains a pool of zeroed pages that are KSM-friendly? I'll send a patch to reintroduce the reverted code. diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c71debdbc732..a641737725d1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -885,7 +885,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, struct kvm_mmu_page *root; for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) - flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false); + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); return flush; }