Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware

David Matlack <dmatlack@xxxxxxxxxx> · Tue, 9 Aug 2022 09:52:10 -0700

On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote:
[...]
>
> Here are the two approaches, please provide feedback on which one
> looks more appropriate before I start spamming your inbox with my
> patches
>
> Approach A:
> Have per numa node cache for page table pages allocation
>
> Instead of having only one mmu_shadow_page_cache per vcpu, we provide
> multiple caches for each node
>
> either:
> mmu_shadow_page_cache[MAX_NUMNODES]
> or
> mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE]

I think the former approach will work better. The objects[] array is
allocated dynamically during top-up now, so if a vCPU never allocates
a page table to map memory on a given node, KVM will never have to
allocate an objects[] array for that node. Whereas with the latter
approach KVM would have to allocate the entire objects[] array
up-front.

>
> We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value
> instead of 40 to control memory consumption.

I'm not sure we are getting any performance benefit from the cache
size being so high. It doesn't fundamentally change the number of
times a vCPU thread will have to call __get_free_page(), it just
batches more of those calls together. Assuming reducing the cache size
doesn't impact performance, I think it's a good idea to reduce it as
part of this feature.

KVM needs at most PT64_ROOT_MAX_LEVEL (5) page tables to handle a
fault. So we could decrease the mmu_shadow_page_cache.objects[]
capacity to PT64_ROOT_MAX_LEVEL (5) and support up to 8 NUMA nodes
without increasing memory usage. If a user wants to run a VM on an
even larger machine, I think it's safe to consume a few extra bytes
for the vCPU shadow page caches at that point (the machine probably
has 10s of TiB of RAM).

>
> When a fault happens, use the pfn to find which node the page should
> belong to and use the corresponding cache to get page table pages.
>
> struct *page = kvm_pfn_to_refcounted_page(pfn);
> int nid;
> if(page) {
>       nid = page_to_nid(page);
> } else {
>      nid = numa_node_id();
> }
>
> ...
> tdp_mmu_alloc_sp(nid, vcpu);
> ...
>
> static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) {
> ...
>       sp->spt = kvm_mmu_memory_cache_alloc(nid,
> &vcpu->arch.mmu_shadow_page_cache);
> ...
> }
>
>
> Since we are changing cache allocation for page table pages, should we
> also make similar changes for other caches like mmu_page_header_cache,
> mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how
> good this idea is.

We don't currently have a reason to make these objects NUMA-aware, so
I would only recommend it if it somehow makes the code a lot simpler.

>
> Approach B:
> Ask page from the specific node on fault path with option to fallback
> to the original cache and default task policy.
>
> This is what Sean's rough patch looks like.

This would definitely be a simpler approach but could increase the
amount of time a vCPU thread holds the MMU lock when handling a fault,
since KVM would start performing GFP_NOWAIT allocations under the
lock. So my preference would be to try the cache approach first and
see how complex it turns out to be.