Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP

Peter Xu <peterx@xxxxxxxxxx> · Tue, 12 Jul 2022 20:24:39 -0400

On Tue, Jul 12, 2022 at 10:53:48PM +0000, Sean Christopherson wrote:
> On Tue, Jul 12, 2022, Peter Xu wrote:
> > On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> > > Dynamically size struct pte_list_desc's array of sptes based on whether
> > > or not KVM is using TDP.  Commit dc1cff969101 ("KVM: X86: MMU: Tune
> > > PTE_LIST_EXT to be bigger") bumped the number of entries in order to
> > > improve performance when using shadow paging, but its analysis that the
> > > larger size would not affect TDP was wrong.  Consuming pte_list_desc
> > > objects for nested TDP is indeed rare, but _allocating_ objects is not,
> > > as KVM allocates 40 objects for each per-vCPU cache.  Reducing the size
> > > from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to
> > > 1280, and also provides similar savings when eager page splitting for
> > > nested MMUs kicks in.
> > > 
> > > The per-vCPU overhead could be further reduced by using a custom, smaller
> > > capacity for the per-vCPU caches, but that's more of an "and" than
> > > an "or" change, e.g. it wouldn't help the eager page split use case.
> > > 
> > > Set the list size to the bare minimum without completely defeating the
> > > purpose of an array (and because pte_list_add() assumes the array is at
> > > least two entries deep).  A larger size, e.g. 4, would reduce the number
> > > of "allocations", but those "allocations" only become allocations in
> > > truth if a single vCPU depletes its cache to where a topup is needed,
> > > i.e. if a single vCPU "allocates" 30+ lists.  Conversely, those 2 extra
> > > entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested
> > > TDP is used.
> > > 
> > > In the unlikely event that performance of aliased gfns for nested TDP
> > > really is (or becomes) a priority for oddball workloads, KVM could add a
> > > knob to let the admin tune the array size for their environment.
> > > 
> > > Note, KVM also unnecessarily tops up the per-vCPU caches even when not
> > > using rmaps; this can also be addressed separately.
> > 
> > The only possible way of using pte_list_desc when tdp=1 is when the
> > hypervisor tries to map the same host pages with different GPAs?
> 
> Yes, if by "host pages" you mean L1 GPAs.  It happens if the L1 VMM maps multiple
> L2 GFNs to a single L1 GFN, in which case KVM's nTDP shadow MMU needs to rmap
> that single L1 GFN to multiple L2 GFNs.
> 
> > And we don't really have a real use case of that, or.. do we?
> 
> QEMU does it during boot/pre-boot when BIOS remaps the flash region into the lower
> 1mb, i.e. aliases high GPAs to low GPAs.
> 
> > Sorry to start with asking questions, it's just that if we know that
> > pte_list_desc is probably not gonna be used then could we simply skip the
> > cache layer as a whole?  IOW, we don't make the "array size of pte list
> > desc" dynamic, instead we make the whole "pte list desc cache layer"
> > dynamic.  Is it possible?
> 
> Not really?  It's theoretically possible, but it'd require pre-checking that aren't
> aliases, and to do that race free we'd have to do it under mmu_lock, which means
> having to support bailing from the page fault to topup the cache.  The memory
> overhead for the cache isn't so significant that it's worth that level of complexity.

Ah, okay..

So the other question is I'm curious how fundamentally this extra
complexity could help us to save spaces.

The thing is IIUC slub works in page sizes, so at least one slub cache eats
one page which is 4096 anyway.  In our case if there was 40 objects
allocated for 14 entries array, are you sure it'll still be 40 objects but
only smaller?  I'd thought after the change each obj is smaller but slub
could have cached more objects since min slub size is 4k for x86.

I don't remember the details of the eager split work on having per-vcpu
caches, but I'm also wondering if we cannot drop the whole cache layer
whether we can selectively use slub in this case, then we can cache much
less assuming we will use just less too.

Currently:

	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);

We could have the pte list desc cache layer to be managed manually
(e.g. using kmalloc()?) for tdp=1, then we'll at least in control of how
many objects we cache?  Then with a limited number of objects, the wasted
memory is much reduced too.

I think I'm fine with current approach too, but only if it really helps
reduce memory footprint as we expected.

Thanks,

-- 
Peter Xu