Re: [PATCH] KVM: x86: Fast forward the iterator when zapping the TDP MMU

Bernhard Kauer <bk@xxxxxxxxx> · Thu, 24 Oct 2024 12:25:08 +0200

On Wed, Oct 23, 2024 at 04:27:50PM -0700, Sean Christopherson wrote:
> On Wed, Oct 23, 2024, Bernhard Kauer wrote:
> > Zapping a root means scanning for present entries in a page-table
> > hierarchy. This process is relatively slow since it needs to be
> > preemtible as millions of entries might be processed.
> > 
> > Furthermore the root-page is traversed multiple times as zapping
> > is done with increasing page-sizes.
> > 
> > Optimizing for the not-present case speeds up the hello microbenchmark
> > by 115 microseconds.
> 
> What is the "hello" microbenchmark?  Do we actually care if it's faster?

Hello is a tiny kernel that just outputs "Hello world!" over a virtual
serial port and then shuts the VM down.  It is the minimal test-case that
reveals performance bottlenecks hard to see in the noise of a big system.

Does it matter?  The case I optimized might be only relevant for
short-running virtual machines.  However, you found more users of
the iterator that might benefit from it.

> Are you able to determine exactly what makes iteration slow? 

I've counted the loop and the number of entries removed:

	[24661.896626] zap root(0, 1) loops 3584 entries 2
	[24661.896655] zap root(0, 2) loops 2048 entries 3
	[24661.896709] zap root(0, 3) loops 1024 entries 2
	[24661.896750] zap root(0, 4) loops 512 entries 1
	[24661.896812] zap root(1, 1) loops 512 entries 0
	[24661.896856] zap root(1, 2) loops 512 entries 0
	[24661.896895] zap root(1, 3) loops 512 entries 0
	[24661.896938] zap root(1, 4) loops 512 entries 0

So for this simple case one needs 9216 iterations to go through 18 pagetables
with 512 entries each. My patch reduces this to 303 iterations.

	[24110.032368] zap root(0, 1) loops 118 entries 2
	[24110.032374] zap root(0, 2) loops 69 entries 3
	[24110.032419] zap root(0, 3) loops 35 entries 2
	[24110.032421] zap root(0, 4) loops 17 entries 1
	[24110.032434] zap root(1, 1) loops 16 entries 0
	[24110.032435] zap root(1, 2) loops 16 entries 0
	[24110.032437] zap root(1, 3) loops 16 entries 0
	[24110.032438] zap root(1, 4) loops 16 entries 0

Given the 115 microseconds one loop iteration is roughly 13 nanoseconds. 
With the updates to the iterator and the various checks this sounds
reasonable to me.  Simplifying the inner loop should help here.

> partly because maybe there's a more elegant solution.

Scanning can be avoided if one keeps track of the used entries.

> Regardless of why iteration is slow, I would much prefer to solve this for all
> users of the iterator.  E.g. very lightly tested, and not 100% optimized (though
> should be on par with the below).

Makes sense. I tried it out and it is a bit slower. One can optimize
the while loop in try_side_step() a bit further.