On Mon, Aug 01, 2022, David Matlack wrote: > On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote: > That being said, KVM currently has a gap where a guest doing a lot of > remote memory accesses when touching memory for the first time will > cause KVM to allocate the TDP page tables on the arguably wrong node. Userspace can solve this by setting the NUMA policy on a VMA or shared-object basis. E.g. create dedicated memslots for each NUMA node, then bind each of the backing stores to the appropriate host node. If there is a gap, e.g. a backing store we want to use doesn't properly support mempolicy for shared mappings, then we should enhance the backing store. > > We can improve TDP MMU eager page splitting by making > > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a > > huge page, allocate the new lower level page tables on the same node as the > > huge page. > > > > __get_free_page() is replaced by alloc_page_nodes(). This introduces two > > functional changes. > > > > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to > > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is > > not passed in tdp_mmu_alloc_sp_for_split() anyway. > > > > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for > > the NUMA node allocation. From this commit, thread's mempolicy will not > > be used and first preference will be to allocate on the node where huge > > page was present. > > It would be worth noting that userspace could change the mempolicy of > the thread doing eager splitting to prefer allocating from the target > NUMA node, as an alternative approach. > > I don't prefer the alternative though since it bleeds details from KVM > into userspace, such as the fact that enabling dirty logging does eager > page splitting, which allocates page tables. As above, if userspace is cares about vNUMA, then it already needs to be aware of some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g. userspace could stitch together contiguous VMAs to create a single mega-memslot, but that seems like it'd be more work than just creating separate memslots. And because eager page splitting for dirty logging runs with mmu_lock held for, userspace might also benefit from per-node memslots as it can do the splitting on multiple tasks/CPUs. Regardless of what we do, the behavior needs to be document, i.e. KVM details will bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then that should be documented. > It's also unnecessary since KVM can infer an appropriate NUMA placement > without the help of userspace, and I can't think of a reason for userspace to > prefer a different policy. I can't think of a reason why userspace would want to have a different policy for the task that's enabling dirty logging, but I also can't think of a reason why KVM should go out of its way to ignore that policy. IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to document how to effectively configure vNUMA-aware memslots.