On Thu, Dec 29, 2022 at 2:30 PM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > On Wed, Dec 21, 2022 at 06:34:53PM -0800, Vipin Sharma wrote: > > When dirty log is enabled, huge pages are split. Page table's pages > > during the split are allocated based on the current thread NUMA node or > > mempolicy. This causes inefficient page table accesses if underlying > > page is on a different NUMA node > > > > Allocate page table's pages on the same NUMA node as the underlying huge > > page when dirty log is enabled and huge pages are split. > > > > The performance gain during the pre-copy phase of live migrations of a > > 416 vCPUs and 11 TiB memory VM on a 8 node host was seen in the range > > of 130% to 150%. > > Can you be more specific about this. "The performance" is vague. I know > it's an internal workload and fully explaining it would be difficult, > but you can give readers a slightly more specific idea of what improved. > e.g. > > When testing with a synthetic write-heavy workload in a 416 vCPU VM on > an 8 NUMA node host, the throughput increased by 150% from X to Y > operations per second. > > It's also necessary to characterize the improvement relative to the > performance when dirty logging is not enabled. Whithout that information > it would be hard for an unfamiliar reader to understand how useful this > change really is. > > For example, let's say the throughput of your workload is 100,000 > operations per second before dirty logging is enabled, and that drops > down to 1,000 operations per second after dirty logging is enabled. This > commit could increase that by 150% to 2,500 operations per second, but > that's actually not a very meaningful improvement since, either way, > guest performance is degraded by 95+% during dirty logging. > > On the other hand, if performance goes from 100,000 to 30,000 normally, > and this commit increases that 30,000 to 75,000 (150%), that's a much > more meaningful improvement. > Yeah, I will provide more insight in the next version. > > > > Suggested-by: David Matlack <dmatlack@xxxxxxxxxx> > > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx> > > --- > > arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++---- > > include/linux/kvm_host.h | 18 ++++++++++++++++++ > > 2 files changed, 26 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 4974fa96deff..376b8dceb3f9 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, > > return spte_set; > > } > > > > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) > > { > > struct kvm_mmu_page *sp; > > > > @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > > if (!sp) > > return NULL; > > > > - sp->spt = (void *)__get_free_page(gfp); > > + sp->spt = kvm_mmu_get_free_page(nid, gfp); > > + > > if (!sp->spt) { > > kmem_cache_free(mmu_page_header_cache, sp); > > return NULL; > > @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > bool shared) > > { > > struct kvm_mmu_page *sp; > > + int nid; > > + > > + nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte)); > > > > /* > > * Since we are allocating while under the MMU lock we have to be > > @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > * If this allocation fails we drop the lock and retry with reclaim > > * allowed. > > */ > > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); > > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT); > > if (sp) > > return sp; > > > > @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > write_unlock(&kvm->mmu_lock); > > > > iter->yielded = true; > > - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT); > > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT); > > > > if (shared) > > read_lock(&kvm->mmu_lock); > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index d48064503b88..a262e15ebd19 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm); > > int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu); > > > > struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn); > > + > > +/* > > + * Tells the appropriate NUMA node location of the page table's page based on > > + * pfn it will point to. > > I know what you are trying to say but the wording is a bit awkward. e.g. > "Tells" instead of "Returns", "location" is redundant, "page table's > page", etc. Suggest this: > > /* > * Returns an appropriate NUMA node on which to allocate a page table that > * maps @pfn. > */ > > > + * > > + * Return the nid of the page if pfn is valid and backed by a refcounted page, > > + * otherwise, return the nearest memory node for the current CPU. > > I would just drop this as it's just restating the code, which is already > very readable. > Okay. > > + */ > > +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn) > > +{ > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > + > > + if (page) > > + return page_to_nid(page); > > + else > > + return numa_mem_id(); > > +} > > + > > bool kvm_is_zone_device_page(struct page *page); > > > > struct kvm_irq_ack_notifier { > > -- > > 2.39.0.314.g84b9a713c41-goog > >