Re: Deadlock due to EPT_VIOLATION

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 8 Aug 2023, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Amaan Cheval wrote:
> > Hey Sean,
> > 
> > > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> > > mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> > > dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> > > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> > > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> > > NUMA balancing to get false positives when determining whether or not to try and
> > > migrate a page.
> > 
> > What are some situations where it might not be able to move the page in the end?
> 
> There's a pretty big list, see the "failure" paths of do_numa_page() and
> migrate_misplaced_page().
> 
> > > That said, it's definitely very unexpected that NUMA balancing would be zapping
> > > SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> > > possible that that's what's happening, but quite unlikely, especially since it
> > > sounds like you're seeing issues even with NUMA balancing disabled.
> > 
> > Yep, we're definitely seeing the issue occur even with numa_balancing
> > enabled, but the likelihood of it occurring has significantly dropped since
> > we've disabled numa_balancing.
> 
> So the fact that this occurs with NUMA balancing disabled means the problem likely
> isn't with NUMA balancing itself.  NUMA balancing probably exposes some underlying
> issue due to it generating a near-constant stream of mmu_notifier invalidation.
> 
> > > More likely is that there is a bug somewhere that results in the mmu_notifier
> > > event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
> > > the VM across a live migration...
> > 
> > Good news! We managed to live migrate a guest and that did "fix it".

Does the VM make progress even if is migrated to a kernel that presents 
the bug?

What was kernel version being migrated from and to?

For example, was it from a >5.19 kernel to something earlier than 5.19?

For example, if the hung VM remains stuck after migrating to a >5.19 
kernel but _not_ to a <5.19 kernel, then maybe bisect is an option.


--
Eric Wheeler




> ...
> 
> > A colleague also modified a host kernel with KFI (Kernel Function
> > Instrumentation) and wrote a kernel module to intercept the vmexit handler,
> > handle_ept_violation, and does an EPT walk for each pfn, compared against
> > /proc/iomem.
> > 
> > Assuming the EPT walking code is correct, we see this surprising result of a
> > PDPTE's pfn=0:
> 
> Not surprising.  The entire EPTE is zero, i.e. has been zapped by KVM.  This is
> exactly what is expected.
> 
> > Does this seem to indicate an mmu_notifier refcount issue to you, given that
> > migration did fix it? Any way to verify?
> 
> It doesn't move the needle either way, it just confirms what we already know: the
> vCPU is repeatedly taking !PRESENT faults.  The unexpected part is that KVM never
> "fixes" the fault and never outright fails.
> 
> > We haven't found any guests with `softlockup_panic=1` yet, and since we can't
> > reproduce the issue on command ourselves yet, we might have to wait a bit - but
> > I imagine that the fact that live migration "fixed" the locked up guest confirms
> > that the other guests that didn't get "fixed" were likely softlocked from the
> > CPU stalling?
> 
> Yes.
> 
> > If you have any suggestions on how modifying the host kernel (and then migrating
> > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > further, let me know!
> > 
> > Thanks for all your help so far!
> 
> Since it sounds like you can test with a custom kernel, try running with this
> patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> below expands said tracepoint to capture information about mmu_notifiers and
> memslots generation.  With luck, it will reveal a smoking gun.
> 
> ---
>  arch/x86/kvm/mmu/mmu.c          | 10 ----------
>  arch/x86/kvm/mmu/mmu_internal.h |  2 ++
>  arch/x86/kvm/mmu/tdp_mmu.h      | 10 ++++++++++
>  arch/x86/kvm/trace.h            | 28 ++++++++++++++++++++++++++--
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..122bfc884293 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2006,16 +2006,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm,
>  	return true;
>  }
>  
> -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> -	if (sp->role.invalid)
> -		return true;
> -
> -	/* TDP MMU pages do not use the MMU generation. */
> -	return !is_tdp_mmu_page(sp) &&
> -	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> -}
> -
>  struct mmu_page_path {
>  	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
>  	unsigned int idx[PT64_ROOT_MAX_LEVEL];
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index f1ef670058e5..cf7ba0abaa8f 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
>  #include <linux/kvm_host.h>
>  #include <asm/kvm_host.h>
>  
> +#include "mmu.h"
> +
>  #ifdef CONFIG_KVM_PROVE_MMU
>  #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>  #else
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 0a63b1afabd3..a0d7c8acf78f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -76,4 +76,14 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
>  #endif
>  
> +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	if (sp->role.invalid)
> +		return true;
> +
> +	/* TDP MMU pages do not use the MMU generation. */
> +	return !is_tdp_mmu_page(sp) &&
> +	       unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 83843379813e..ff4a384ab03a 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -8,6 +8,8 @@
>  #include <asm/clocksource.h>
>  #include <asm/pvclock-abi.h>
>  
> +#include "mmu/tdp_mmu.h"
> +
>  #undef TRACE_SYSTEM
>  #define TRACE_SYSTEM kvm
>  
> @@ -405,6 +407,13 @@ TRACE_EVENT(kvm_page_fault,
>  		__field(	unsigned long,	guest_rip	)
>  		__field(	u64,		fault_address	)
>  		__field(	u64,		error_code	)
> +		__field(	unsigned long,  mmu_invalidate_seq)
> +		__field(	long,  mmu_invalidate_in_progress)
> +		__field(	unsigned long,  mmu_invalidate_range_start)
> +		__field(	unsigned long,  mmu_invalidate_range_end)
> +		__field(	bool,		root_is_valid)
> +		__field(	bool,		root_has_sp)
> +		__field(	bool,		root_is_obsolete)
>  	),
>  
>  	TP_fast_assign(
> @@ -412,11 +421,26 @@ TRACE_EVENT(kvm_page_fault,
>  		__entry->guest_rip	= kvm_rip_read(vcpu);
>  		__entry->fault_address	= fault_address;
>  		__entry->error_code	= error_code;
> +		__entry->mmu_invalidate_seq		= vcpu->kvm->mmu_invalidate_seq;
> +		__entry->mmu_invalidate_in_progress	= vcpu->kvm->mmu_invalidate_in_progress;
> +		__entry->mmu_invalidate_range_start	= vcpu->kvm->mmu_invalidate_range_start;
> +		__entry->mmu_invalidate_range_end	= vcpu->kvm->mmu_invalidate_range_end;
> +		__entry->root_is_valid			= VALID_PAGE(vcpu->arch.mmu->root.hpa);
> +		__entry->root_has_sp			= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> +							  to_shadow_page(vcpu->arch.mmu->root.hpa);
> +		__entry->root_is_obsolete		= VALID_PAGE(vcpu->arch.mmu->root.hpa) &&
> +							  to_shadow_page(vcpu->arch.mmu->root.hpa) &&
> +							  is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa));
>  	),
>  
> -	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx",
> +	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx, seq = 0x%lx, in_prog = 0x%lx, start = 0x%lx, end = 0x%lx, root = %s",
>  		  __entry->vcpu_id, __entry->guest_rip,
> -		  __entry->fault_address, __entry->error_code)
> +		  __entry->fault_address, __entry->error_code,
> +		  __entry->mmu_invalidate_seq, __entry->mmu_invalidate_in_progress,
> +		  __entry->mmu_invalidate_range_start, __entry->mmu_invalidate_range_end,
> +		  !__entry->root_is_valid ? "invalid" :
> +		  !__entry->root_has_sp ? "no shadow page" :
> +		  __entry->root_is_obsolete ? "obsolete" : "fresh")
>  );
>  
>  /*
> 
> base-commit: 240f736891887939571854bd6d734b6c9291f22e
> -- 
> 
> 



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux