TLDR ==== This patchset RCU-protects KVM page tables and compare-and-exchanges KVM PTEs with the accessed bit set by hardware. It significantly improves the performance of guests when the host is under heavy memory pressure. ChromeOS has been using a similar approach [1] since mid 2021 and it was proven successful on tens of millions devices. [1] https://crrev.com/c/2987928 Overview ======== The goal of this patchset is to optimize the performance of guests when the host memory is overcommitted. It focuses on the vast majority of VMs that are not nested and run on hardware that sets the accessed bit in KVM page tables. Note that nested VMs and hardware that does not support the accessed bit are both out of scope. This patchset relies on two techniques, RCU and cmpxchg, to safely test and clear the accessed bit without taking kvm->mmu_lock. The former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both hardware and other software page table walkers. A new MMU notifier API, mmu_notifier_test_clear_young(), is introduced. It follows two design patterns: fallback and batching. For any unsupported cases, it can optionally fall back to mmu_notifier_ops->clear_young(). For a range of KVM PTEs, it can test or test and clear their accessed bits according to a bitmap provided by the caller. This patchset only applies mmu_notifier_test_clear_young() to MGLRU. A follow-up patchset will apply it to /proc/PID/pagemap and /prod/PID/clear_refs. Evaluation ========== An existing selftest can quickly demonstrate the effectiveness of this patchset. On a generic workstation equipped with 64 CPUs and 256GB DRAM: $ sudo max_guest_memory_test -c 64 -m 256 -s 256 MGLRU run2 --------------- Before ~600s After ~50s Off ~250s kswapd (MGLRU before) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.97% try_to_shrink_lruvec 99.06% evict_folios 97.41% shrink_folio_list 31.33% folio_referenced 31.06% rmap_walk_file 30.89% folio_referenced_one 20.83% __mmu_notifier_clear_flush_young 20.54% kvm_mmu_notifier_clear_flush_young => 19.34% _raw_write_lock kswapd (MGLRU after) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.97% try_to_shrink_lruvec 99.51% evict_folios 71.70% shrink_folio_list 7.08% folio_referenced 6.78% rmap_walk_file 6.72% folio_referenced_one 5.60% lru_gen_look_around => 1.53% __mmu_notifier_test_clear_young kswapd (MGLRU off) 100.00% balance_pgdat 100.00% shrink_node 99.92% shrink_lruvec 69.95% shrink_folio_list 19.35% folio_referenced 18.37% rmap_walk_file 17.88% folio_referenced_one 13.20% __mmu_notifier_clear_flush_young 11.64% kvm_mmu_notifier_clear_flush_young => 9.93% _raw_write_lock 26.23% shrink_active_list 25.50% folio_referenced 25.35% rmap_walk_file 25.28% folio_referenced_one 23.87% __mmu_notifier_clear_flush_young 23.69% kvm_mmu_notifier_clear_flush_young => 18.98% _raw_write_lock Comprehensive benchmarks are coming soon. Yu Zhao (5): mm/kvm: add mmu_notifier_test_clear_young() kvm/x86: add kvm_arch_test_clear_young() kvm/arm64: add kvm_arch_test_clear_young() kvm/powerpc: add kvm_arch_test_clear_young() mm: multi-gen LRU: use mmu_notifier_test_clear_young() arch/arm64/include/asm/kvm_host.h | 7 ++ arch/arm64/include/asm/kvm_pgtable.h | 8 ++ arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++ arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 51 ++-------- arch/arm64/kvm/mmu.c | 77 +++++++++++++- arch/powerpc/include/asm/kvm_host.h | 18 ++++ arch/powerpc/include/asm/kvm_ppc.h | 14 +-- arch/powerpc/kvm/book3s.c | 7 ++ arch/powerpc/kvm/book3s.h | 2 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 ++++++++++++++- arch/powerpc/kvm/book3s_hv.c | 10 +- arch/x86/include/asm/kvm_host.h | 27 +++++ arch/x86/kvm/mmu/spte.h | 12 --- arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++ include/linux/kvm_host.h | 29 ++++++ include/linux/mmu_notifier.h | 40 ++++++++ include/linux/mmzone.h | 6 +- mm/mmu_notifier.c | 26 +++++ mm/rmap.c | 8 +- mm/vmscan.c | 127 +++++++++++++++++++++--- virt/kvm/kvm_main.c | 58 +++++++++++ 22 files changed, 593 insertions(+), 97 deletions(-) -- 2.39.2.637.g21b0678d19-goog