On Wed, Dec 28, 2022 at 2:08 PM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote: > > On Tue, Dec 27, 2022 at 11:34 AM Ben Gardon <bgardon@xxxxxxxxxx> wrote: > > > > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote: > > > > > > Page table pages of a VM are currently allocated based on the current > > > task's NUMA node or its mempolicy. This can cause suboptimal remote > > > accesses by the vCPU if it is accessing physical pages local to its NUMA > > > node but the page table pages mapping those physcal pages were created > > > by some other vCPU which was on different NUMA node or had different > > > policy. > > > > > > Allocate page table pages on the same NUMA node where underlying > > > physical page exists. Page table at level 5, 4, and 3 might not end up > > > on the same NUMA node as they can span multiple NUMA nodes. > > > > A page table at any level could map memory spanning multiple NUMA > > nodes, it just becomes more likely at higher levels. > > We're only guaranteed that a page table maps memory all on the same > > node if it's a split hugepage. > > Even in this case, it is a best effort. > > > This change can only guarantee that the page table pages are allocated > > on the same node as at least some of the memory they map. > > Of course in practice, the above is absolutely correct since we'd > > expect to have multi-GB continuous ranges of GFNs allocated on the > > same node via huge pages. > > > > And since the root pages are allocated based only on where the thread > > allocating them is running, they're not actually guaranteed to be on > > the same node as any of the memory they map. (Though they probably > > will be.) > > > > I will add more details in the commit in the next version. > > > > > > > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx> > > > --- > > > arch/x86/include/asm/kvm_host.h | 2 +- > > > arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++++++++----------- > > > arch/x86/kvm/mmu/paging_tmpl.h | 4 +-- > > > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++--- > > > virt/kvm/kvm_main.c | 2 +- > > > 5 files changed, 53 insertions(+), 29 deletions(-) > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > > index 293994fabae3..b1f319ad6f89 100644 > > > --- a/arch/x86/include/asm/kvm_host.h > > > +++ b/arch/x86/include/asm/kvm_host.h > > > @@ -782,7 +782,7 @@ struct kvm_vcpu_arch { > > > struct kvm_mmu *walk_mmu; > > > > > > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; > > > - struct kvm_mmu_memory_cache mmu_shadow_page_cache; > > > + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES]; > > > struct kvm_mmu_memory_cache mmu_shadowed_info_cache; > > > struct kvm_mmu_memory_cache mmu_page_header_cache; > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 23a3b82b2384..511c6ef265ee 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -677,24 +677,29 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache, > > > > > > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) > > > { > > > - int r; > > > + int r, nid; > > > > > > /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */ > > > r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache, > > > 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM); > > > if (r) > > > return r; > > > - r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, > > > - &vcpu->arch.mmu_shadow_page_cache_lock, > > > - PT64_ROOT_MAX_LEVEL); > > > - if (r) > > > - return r; > > > + > > > + for_each_online_node(nid) { > > > + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid], > > > + &vcpu->arch.mmu_shadow_page_cache_lock, > > > + PT64_ROOT_MAX_LEVEL); > > > + if (r) > > > + return r; > > > + } > > > + > > > if (maybe_indirect) { > > > r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache, > > > PT64_ROOT_MAX_LEVEL); > > > if (r) > > > return r; > > > } > > > + > > > return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache, > > > PT64_ROOT_MAX_LEVEL); > > > } > > > @@ -715,9 +720,14 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache, > > > > > > static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) > > > { > > > + int nid; > > > + > > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache); > > > - mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, > > > - &vcpu->arch.mmu_shadow_page_cache_lock); > > > + > > > + for_each_node(nid) > > > + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid], > > > + &vcpu->arch.mmu_shadow_page_cache_lock); > > > + > > > > Was just trying to think if there could be any issue with memory > > leakage if the online nodes changed, though IDK if any hardware does > > that. > > Still, it might be more robust to use ARRAY_SIZE and cover the whole array. > > for_each_node() goes through all of the possible nodes on the system, > whereas, for_each_online_node() goes through only online nodes. > Current code seems right to me, let me know if I am overlooking > something. Ah okay, I didn't see the distinction. That sounds good to me. > > > > > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache); > > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); > > > } > > > @@ -2256,11 +2266,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm, > > > > > > static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, > > > gfn_t gfn, > > > - union kvm_mmu_page_role role) > > > + union kvm_mmu_page_role role, > > > + int nid) > > > { > > > struct shadow_page_caches caches = { > > > .page_header_cache = &vcpu->arch.mmu_page_header_cache, > > > - .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache, > > > + .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid], > > > .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache, > > > .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock > > > }; > > > @@ -2316,15 +2327,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, > > > > > > static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, > > > u64 *sptep, gfn_t gfn, > > > - bool direct, unsigned int access) > > > + bool direct, unsigned int access, > > > + kvm_pfn_t pfn) > > > { > > > union kvm_mmu_page_role role; > > > + int nid; > > > > > > if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) > > > return ERR_PTR(-EEXIST); > > > > > > role = kvm_mmu_child_role(sptep, direct, access); > > > - return kvm_mmu_get_shadow_page(vcpu, gfn, role); > > > + nid = kvm_pfn_to_page_table_nid(pfn); > > > + > > > + return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid); > > > } > > > > > > static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, > > > @@ -3208,7 +3223,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > > if (it.level == fault->goal_level) > > > break; > > > > > > - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL); > > > + sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, > > > + ACC_ALL, fault->pfn); > > > if (sp == ERR_PTR(-EEXIST)) > > > continue; > > > > > > @@ -3636,7 +3652,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, > > > WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte); > > > WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); > > > > > > - sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); > > > + sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, numa_mem_id()); > > > ++sp->root_count; > > > > > > return __pa(sp->spt); > > > @@ -5952,7 +5968,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) > > > > > > int kvm_mmu_create(struct kvm_vcpu *vcpu) > > > { > > > - int ret; > > > + int ret, nid; > > > > > > INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache, > > > pte_list_desc_cache, NUMA_NO_NODE); > > > @@ -5960,8 +5976,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) > > > INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache, > > > mmu_page_header_cache, NUMA_NO_NODE); > > > > > > - INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache, > > > - NULL, NUMA_NO_NODE); > > > + for_each_node(nid) > > > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid], > > > + NULL, nid); > > > spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock); > > > > > > vcpu->arch.mmu = &vcpu->arch.root_mmu; > > > @@ -6692,13 +6709,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) > > > } > > > > > > static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache, > > > + int cache_count, > > > spinlock_t *cache_lock) > > > { > > > unsigned long freed = 0; > > > + int nid; > > > > > > spin_lock(cache_lock); > > > - if (cache->nobjs) > > > - freed = kvm_mmu_empty_memory_cache(cache); > > > + for (nid = 0; nid < cache_count; nid++) { > > > + if (node_online(nid) && cache[nid].nobjs) > > > > Is there any reason to keep the cache if !node_online(nid)? > > Actually, I'd also just drop the cache_count argument and always > > iterate over the entire array, only checking nobjs. There's no > > guarantee I'm aware of that the set of nodes has a sequential series > > of IDs starting at 0 and you'd get a bug if that wasn't the case since > > it only iterates to nid < cache_count here but some of the earlier > > nids might not have been online. > > > > This is just temporary and will be removed in the next patch in the series. > > mmu_shrink_cache() is used for both split_shadow_page_cache (single > object) and mmu_shadow_page_cache[MAX_NUMANODES]. > > In next patch of this series, I used for_each_online_node(nide), I > will change it to for_each_node() in the next version. > > > > + freed += kvm_mmu_empty_memory_cache(&cache[nid]); > > > + } > > > spin_unlock(cache_lock); > > > return freed; > > > } > > > @@ -6721,13 +6742,15 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > list_move_tail(&kvm->vm_list, &vm_list); > > > > > > freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache, > > > + 1, > > > > So lonely. > > One. > > All by itself, > > with only a coma for company. > > > > NIT: This could be merged to the previous or subsequent lines. > > This is a strong and independent '1'. > > > > > > &kvm->arch.split_shadow_page_cache_lock); > > > > > > if (freed >= sc->nr_to_scan) > > > break; > > > > > > kvm_for_each_vcpu(i, vcpu, kvm) { > > > - freed += mmu_shrink_cache(&vcpu->arch.mmu_shadow_page_cache, > > > + freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache, > > > + MAX_NUMNODES, > > > &vcpu->arch.mmu_shadow_page_cache_lock); > > > } > > > > > > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h > > > index e5662dbd519c..1ceca62ec4cf 100644 > > > --- a/arch/x86/kvm/mmu/paging_tmpl.h > > > +++ b/arch/x86/kvm/mmu/paging_tmpl.h > > > @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > table_gfn = gw->table_gfn[it.level - 2]; > > > access = gw->pt_access[it.level - 2]; > > > sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn, > > > - false, access); > > > + false, access, fault->pfn); > > > > > > if (sp != ERR_PTR(-EEXIST)) { > > > /* > > > @@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > validate_direct_spte(vcpu, it.sptep, direct_access); > > > > > > sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, > > > - true, direct_access); > > > + true, direct_access, fault->pfn); > > > if (sp == ERR_PTR(-EEXIST)) > > > continue; > > > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > > index 376b8dceb3f9..b5abae2366dd 100644 > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > > @@ -259,12 +259,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, > > > kvm_mmu_page_as_id(_root) != _as_id) { \ > > > } else > > > > > > -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu) > > > +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid) > > > { > > > struct kvm_mmu_page *sp; > > > > > > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); > > > - sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache, > > > + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid], > > > &vcpu->arch.mmu_shadow_page_cache_lock); > > > > > > return sp; > > > @@ -317,7 +317,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) > > > goto out; > > > } > > > > > > - root = tdp_mmu_alloc_sp(vcpu); > > > + root = tdp_mmu_alloc_sp(vcpu, numa_mem_id()); > > > > Might be worth calling out somewhere that the root page is just > > allocated based on where the thread allocating it runs. > > > > How about a comment just up here or do you prefer at tdp_mmu_roots in > struct kvm_arch{}? Here or just in the commit description or cover letter. Thanks! > > > > tdp_mmu_init_sp(root, NULL, 0, role); > > > > > > refcount_set(&root->tdp_mmu_root_count, 1); > > > @@ -1149,7 +1149,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > > struct kvm *kvm = vcpu->kvm; > > > struct tdp_iter iter; > > > struct kvm_mmu_page *sp; > > > - int ret = RET_PF_RETRY; > > > + int ret = RET_PF_RETRY, nid; > > > > > > kvm_mmu_hugepage_adjust(vcpu, fault); > > > > > > @@ -1178,11 +1178,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > > !is_large_pte(iter.old_spte)) > > > continue; > > > > > > + nid = kvm_pfn_to_page_table_nid(fault->pfn); > > > /* > > > * The SPTE is either non-present or points to a huge page that > > > * needs to be split. > > > */ > > > - sp = tdp_mmu_alloc_sp(vcpu); > > > + sp = tdp_mmu_alloc_sp(vcpu, nid); > > > tdp_mmu_init_child_sp(sp, &iter); > > > > > > sp->nx_huge_page_disallowed = fault->huge_page_disallowed; > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index d96c8146e9ba..4f3db7ffeba8 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -415,7 +415,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc, > > > if (mc->kmem_cache) > > > return kmem_cache_alloc(mc->kmem_cache, gfp_flags); > > > else > > > - return (void *)__get_free_page(gfp_flags); > > > + return kvm_mmu_get_free_page(mc->node, gfp_flags); > > > > You could do part of this change in the commit that introduced > > kvm_mmu_get_free_page too. > > Yeah, I can do it there as well. No strong opinions. I will update in > the next version. > > > > } > > > > > > int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min) > > > -- > > > 2.39.0.314.g84b9a713c41-goog > > >