Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Wed, 26 Feb 2025 20:54:16 -0500

On Wed, 2025-02-26 at 16:51 -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-25 at 16:50 -0800, Sean Christopherson wrote:
> > > On Tue, Feb 25, 2025, Maxim Levitsky wrote:
> > > What if we make the assertion user controllable?  I.e. let the user opt-out (or
> > > off-by-default and opt-in) via command line?  We did something similar for the
> > > rseq test, because the test would run far fewer iterations than expected if the
> > > vCPU task was migrated to CPU(s) in deep sleep states.
> > > 
> > > 	TEST_ASSERT(skip_sanity_check || i > (NR_TASK_MIGRATIONS / 2),
> > > 		    "Only performed %d KVM_RUNs, task stalled too much?\n\n"
> > > 		    "  Try disabling deep sleep states to reduce CPU wakeup latency,\n"
> > > 		    "  e.g. via cpuidle.off=1 or setting /dev/cpu_dma_latency to '0',\n"
> > > 		    "  or run with -u to disable this sanity check.", i);
> > > 
> > > This is quite similar, because as you say, it's impractical for the test to account
> > > for every possible environmental quirk.
> > 
> > No objections in principle, especially if sanity check is skipped by default, 
> > although this does sort of defeats the purpose of the check. 
> > I guess that the check might still be used for developers.
> 
> A middle ground would be to enable the check by default if NUMA balancing is off.
> We can always revisit the default setting if it turns out there are other problematic
> "features".

That works for me.
I can send a patch for this then.

> 
> > > > > Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> > > > > different node, and that causes NUMA balancing to go crazy and zap pretty much
> > > > > all of guest memory.  If that's what's happening, then a better solution for the
> > > > > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> > > > > pin it to a single pCPU?).
> > > > 
> > > > Nope. I pinned main thread to  CPU 0 and VM thread to  CPU 1 and the problem
> > > > persists.  On 6.13, the only way to make the test consistently work is to
> > > > disable NUMA balancing.
> > > 
> > > Well that's odd.  While I'm quite curious as to what's happening,
> 
> Gah, chatting about this offline jogged my memory.  NUMA balancing doesn't zap
> (mark PROT_NONE/PROT_NUMA) PTEs for paging the kernel thinks are being accessed
> remotely, it zaps PTEs to see if they're are being accessed remotely.  So yeah,
> whenever NUMA balancing kicks in, the guest will see a large amount of its memory
> get re-faulted.
> 
> Which is why it's such a terribly feature to pair with KVM, at least as-is.  NUMA
> balancing is predicated on inducing and resolving the #PF being relatively cheap,
> but that doesn't hold true for secondary MMUs due to the coarse nature of mmu_notifiers.
> 

I also think so, not to mention that VM exits aren't that cheap either, and the general
direction is to avoid them as much as possible, and here this feature pretty much
yanks the memory out of the guest every two seconds or so, causing lots of VM exits.

Best regards,
	Maxim Levitsky