TLDR ==== This patchset adds a fast path to clear the accessed bit without taking kvm->mmu_lock. It can significantly improve the performance of guests when the host is under heavy memory pressure. ChromeOS has been using a similar approach [1] since mid 2021 and it was proven successful on tens of millions devices. This v2 addressed previous requests [2] on refactoring code, removing inaccurate/redundant texts, etc. [1] https://crrev.com/c/2987928 [2] https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@xxxxxxxxxx/ Overview ======== The goal of this patchset is to optimize the performance of guests when the host memory is overcommitted. It focuses on a simple yet common case where hardware sets the accessed bit in KVM PTEs and VMs are not nested. Complex cases fall back to the existing slow path where kvm->mmu_lock is then taken. The fast path relies on two techniques to safely clear the accessed bit: RCU and CAS. The former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both the hardware and other software page table walkers. A new mmu_notifier_ops member, test_clear_young(), supersedes the existing clear_young() and test_young(). This extended callback can operate on a range of KVM PTEs individually according to a bitmap, if the caller provides it. Evaluation ========== An existing selftest can quickly demonstrate the effectiveness of this patchset. On a generic workstation equipped with 128 CPUs and 256GB DRAM: $ sudo max_guest_memory_test -c 64 -m 250 -s 250 MGLRU run2 ------------------ Before [1] ~64s After ~51s kswapd (MGLRU before) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.71% evict_folios 97.29% shrink_folio_list ==>> 13.05% folio_referenced 12.83% rmap_walk_file 12.31% folio_referenced_one 7.90% __mmu_notifier_clear_young 7.72% kvm_mmu_notifier_clear_young 7.34% _raw_write_lock kswapd (MGLRU after) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.59% evict_folios 80.37% shrink_folio_list ==>> 3.74% folio_referenced 3.59% rmap_walk_file 3.19% folio_referenced_one 2.53% lru_gen_look_around 1.06% __mmu_notifier_test_clear_young Comprehensive benchmarks are coming soon. [1] "mm: rmap: Don't flush TLB after checking PTE young for page reference" was included so that the comparison is apples to apples. https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@xxxxxxxxx/ Yu Zhao (10): mm/kvm: add mmu_notifier_ops->test_clear_young() mm/kvm: use mmu_notifier_ops->test_clear_young() kvm/arm64: export stage2_try_set_pte() and macros kvm/arm64: make stage2 page tables RCU safe kvm/arm64: add kvm_arch_test_clear_young() kvm/powerpc: make radix page tables RCU safe kvm/powerpc: add kvm_arch_test_clear_young() kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask kvm/x86: add kvm_arch_test_clear_young() mm: multi-gen LRU: use mmu_notifier_test_clear_young() Documentation/admin-guide/mm/multigen_lru.rst | 6 +- arch/arm64/include/asm/kvm_host.h | 6 + arch/arm64/include/asm/kvm_pgtable.h | 55 +++++++ arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 61 +------- arch/arm64/kvm/mmu.c | 53 ++++++- arch/powerpc/include/asm/kvm_host.h | 8 + arch/powerpc/include/asm/kvm_ppc.h | 1 + arch/powerpc/kvm/book3s.c | 6 + arch/powerpc/kvm/book3s.h | 1 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 65 +++++++- arch/powerpc/kvm/book3s_hv.c | 5 + arch/x86/include/asm/kvm_host.h | 13 ++ arch/x86/kvm/mmu.h | 6 - arch/x86/kvm/mmu/spte.h | 1 - arch/x86/kvm/mmu/tdp_mmu.c | 34 +++++ include/linux/kvm_host.h | 22 +++ include/linux/mmu_notifier.h | 79 ++++++---- include/linux/mmzone.h | 6 +- include/trace/events/kvm.h | 15 -- mm/mmu_notifier.c | 48 ++---- mm/rmap.c | 8 +- mm/vmscan.c | 139 ++++++++++++++++-- virt/kvm/kvm_main.c | 114 ++++++++------ 24 files changed, 546 insertions(+), 207 deletions(-) -- 2.41.0.rc0.172.g3f132b7071-goog