Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 14 Jul 2022 18:43:50 +0000

On Tue, Jul 12, 2022, Peter Xu wrote:
> On Tue, Jul 12, 2022 at 10:53:48PM +0000, Sean Christopherson wrote:
> > On Tue, Jul 12, 2022, Peter Xu wrote:
> > > On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> > > Sorry to start with asking questions, it's just that if we know that
> > > pte_list_desc is probably not gonna be used then could we simply skip the
> > > cache layer as a whole?  IOW, we don't make the "array size of pte list
> > > desc" dynamic, instead we make the whole "pte list desc cache layer"
> > > dynamic.  Is it possible?
> > 
> > Not really?  It's theoretically possible, but it'd require pre-checking that aren't
> > aliases, and to do that race free we'd have to do it under mmu_lock, which means
> > having to support bailing from the page fault to topup the cache.  The memory
> > overhead for the cache isn't so significant that it's worth that level of complexity.
> 
> Ah, okay..
> 
> So the other question is I'm curious how fundamentally this extra
> complexity could help us to save spaces.
> 
> The thing is IIUC slub works in page sizes, so at least one slub cache eats
> one page which is 4096 anyway.  In our case if there was 40 objects
> allocated for 14 entries array, are you sure it'll still be 40 objects but
> only smaller?

Definitely not 100% positive.

> I'd thought after the change each obj is smaller but slub could have cached
> more objects since min slub size is 4k for x86.

> I don't remember the details of the eager split work on having per-vcpu

The eager split logic uses a single per-VM cache, but it's large (513 entries).

> caches, but I'm also wondering if we cannot drop the whole cache layer
> whether we can selectively use slub in this case, then we can cache much
> less assuming we will use just less too.
> 
> Currently:
> 
> 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> 
> We could have the pte list desc cache layer to be managed manually
> (e.g. using kmalloc()?) for tdp=1, then we'll at least in control of how
> many objects we cache?  Then with a limited number of objects, the wasted
> memory is much reduced too.

I suspect that, without implementing something that looks an awful lot like the
kmem caches, manually handling allocations would degrade performance for shadow
paging and nested MMUs.

> I think I'm fine with current approach too, but only if it really helps
> reduce memory footprint as we expected.

Yeah, I'll get numbers before sending v2 (which will be quite some time at this
point).