On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote: [...] > > Here are the two approaches, please provide feedback on which one > looks more appropriate before I start spamming your inbox with my > patches > > Approach A: > Have per numa node cache for page table pages allocation > > Instead of having only one mmu_shadow_page_cache per vcpu, we provide > multiple caches for each node > > either: > mmu_shadow_page_cache[MAX_NUMNODES] > or > mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE] I think the former approach will work better. The objects[] array is allocated dynamically during top-up now, so if a vCPU never allocates a page table to map memory on a given node, KVM will never have to allocate an objects[] array for that node. Whereas with the latter approach KVM would have to allocate the entire objects[] array up-front. > > We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value > instead of 40 to control memory consumption. I'm not sure we are getting any performance benefit from the cache size being so high. It doesn't fundamentally change the number of times a vCPU thread will have to call __get_free_page(), it just batches more of those calls together. Assuming reducing the cache size doesn't impact performance, I think it's a good idea to reduce it as part of this feature. KVM needs at most PT64_ROOT_MAX_LEVEL (5) page tables to handle a fault. So we could decrease the mmu_shadow_page_cache.objects[] capacity to PT64_ROOT_MAX_LEVEL (5) and support up to 8 NUMA nodes without increasing memory usage. If a user wants to run a VM on an even larger machine, I think it's safe to consume a few extra bytes for the vCPU shadow page caches at that point (the machine probably has 10s of TiB of RAM). > > When a fault happens, use the pfn to find which node the page should > belong to and use the corresponding cache to get page table pages. > > struct *page = kvm_pfn_to_refcounted_page(pfn); > int nid; > if(page) { > nid = page_to_nid(page); > } else { > nid = numa_node_id(); > } > > ... > tdp_mmu_alloc_sp(nid, vcpu); > ... > > static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) { > ... > sp->spt = kvm_mmu_memory_cache_alloc(nid, > &vcpu->arch.mmu_shadow_page_cache); > ... > } > > > Since we are changing cache allocation for page table pages, should we > also make similar changes for other caches like mmu_page_header_cache, > mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how > good this idea is. We don't currently have a reason to make these objects NUMA-aware, so I would only recommend it if it somehow makes the code a lot simpler. > > Approach B: > Ask page from the specific node on fault path with option to fallback > to the original cache and default task policy. > > This is what Sean's rough patch looks like. This would definitely be a simpler approach but could increase the amount of time a vCPU thread holds the MMU lock when handling a fault, since KVM would start performing GFP_NOWAIT allocations under the lock. So my preference would be to try the cache approach first and see how complex it turns out to be.