On Mon, Aug 05, 2024, David Matlack wrote: > On Thu, Aug 1, 2024 at 11:35 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > This applies on top of the massive "follow pfn" rework[*]. The gist is to > > avoid losing accessed information, e.g. because NUMA balancing mucks with > > PTEs, > > What do you mean by "NUMA balancing mucks with PTEs"? When NUMA auto-balancing is enabled, for VMAs the current task has been accessing, the kernel will periodically change PTEs (in the primary MMU) to PROT_NONE, i.e. make them !PRESENT. That in turn results in mmu_notifier invalidations (usually for the entire VMA, eventually) that cause KVM to unmap SPTEs. If KVM doesn't mark folios accessed when SPTEs are zapped, the NUMA-induced zapping effectively loses the accessed information. For non-KVM setups, NUMA balancing works quite well because the cost of the #PF to "fix" the NUMA-induced PROT_NONE is relatively cheap, especially compared to the long-term costs of accessing remote memory. For KVM, the cost vs. benefit is very different, as each mmu_notifier invalidation forces KVM to emit a remote TLB flush, i.e. the cost is much higher. And it's also much more feasible (in practice) to affine vCPUs to single NUMA nodes, even if vCPUs are pinned 1:1 with pCPUs, than it is to affine a random userspace task to a NUMA node. Which is why I'm not terribly concerned about optimizing NUMA auto-balancing; it's already sub-optimal for KVM.