Re: Deadlock due to EPT_VIOLATION

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 09, 2023, Eric Wheeler wrote:
> On Wed, 9 Aug 2023, Eric Wheeler wrote:
> > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > On Tue, Aug 08, 2023, Amaan Cheval wrote:
> > > > Hey Sean,
> > > > 
> > > > > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
> > > > > mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
> > > > > dislike NUMA balancing is that it's all too easy to end up with subtle bugs
> > > > > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
> > > > > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
> > > > > NUMA balancing to get false positives when determining whether or not to try and
> > > > > migrate a page.
> > > > 
> > > > What are some situations where it might not be able to move the page in the end?
> > > 
> > > There's a pretty big list, see the "failure" paths of do_numa_page() and
> > > migrate_misplaced_page().
> > > 
> > > > > That said, it's definitely very unexpected that NUMA balancing would be zapping
> > > > > SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
> > > > > possible that that's what's happening, but quite unlikely, especially since it
> > > > > sounds like you're seeing issues even with NUMA balancing disabled.
> 
> Brak indicated that they've seen this as early as v5.19.  IIRC, Hunter
> said that v5.15 is working fine, so I went through the >v5.15 and <v5.19
> commit logs for KVM that appear to be related to EPT. Of course if the
> problem is outside of KVM, then this is moot, but maybe these are worth
> a second look.
> 
> Sean, could any of these commits cause or hint at the problem?

No, it's extremely unlikely any of these are related.  FWIW, my money is on this
being a bug in generic KVM bug or even outside of KVM, not a bug in KVM x86's MMU.
But I'm not confident enough to bet real money ;-)

>   54275f74c KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
> 	- this mentions !PRESENT related to faulting out of mmu_lock.
> 
>   ec283cb1d KVM: x86/mmu: remove ept_ad field
> 	- looks like a simple patch, but could there be a reason that
> 	  this is somehow invalid in corner cases?  Here is the relevant 
> 	  diff snippet:
> 
> 	+++ b/arch/x86/kvm/mmu/mmu.c
> 	@@ -5007,7 +5007,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> 	 
> 			context->shadow_root_level = level;
> 	 
> 	-               context->ept_ad = accessed_dirty;
> 
> 	+++ b/arch/x86/kvm/mmu/paging_tmpl.h
> 	-       #define PT_HAVE_ACCESSED_DIRTY(mmu) ((mmu)->ept_ad)
> 	+       #define PT_HAVE_ACCESSED_DIRTY(mmu) (!(mmu)->cpu_role.base.ad_disabled)
> 
>   ca2a7c22a KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bits
> 	- "No functional change intended" but it mentions EPT
> 	  violations.  Could something unintentional have happened here?
> 
>   4f4aa80e3 KVM: X86: Handle implicit supervisor access with SMAP
> 	- This is a small change, but maybe it would be worth a quick review
> 	
>   5b22bbe71 KVM: X86: Change the type of access u32 to u64
> 	- This is just a datatype change in 5.17-rc3, probably not it.



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux