On Thu, Feb 16, 2023, Yu Zhao wrote: > An existing selftest can quickly demonstrate the effectiveness of this > patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM: Not my area of maintenance, but a non-existent changelog (for all intents and purposes) for a change of this size and complexity is not acceptable. > $ sudo max_guest_memory_test -c 64 -m 250 -s 250 > > MGLRU run2 > --------------- > Before ~600s > After ~50s > Off ~250s > > kswapd (MGLRU before) > 100.00% balance_pgdat > 100.00% shrink_node > 100.00% shrink_one > 99.97% try_to_shrink_lruvec > 99.06% evict_folios > 97.41% shrink_folio_list > 31.33% folio_referenced > 31.06% rmap_walk_file > 30.89% folio_referenced_one > 20.83% __mmu_notifier_clear_flush_young > 20.54% kvm_mmu_notifier_clear_flush_young > => 19.34% _raw_write_lock > > kswapd (MGLRU after) > 100.00% balance_pgdat > 100.00% shrink_node > 100.00% shrink_one > 99.97% try_to_shrink_lruvec > 99.51% evict_folios > 71.70% shrink_folio_list > 7.08% folio_referenced > 6.78% rmap_walk_file > 6.72% folio_referenced_one > 5.60% lru_gen_look_around > => 1.53% __mmu_notifier_test_clear_young Do you happen to know how much of the improvement is due to batching, and how much is due to using a walkless walk? > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG)) > caps |= BIT(LRU_GEN_NONLEAF_YOUNG); > > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK)) > + caps |= BIT(LRU_GEN_SPTE_WALK); As alluded to in patch 1, unless batching the walks even if KVM does _not_ support a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(), I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or if we want to avoid batching when there are no mmu_notifier listeners, probe mmu_notifiers. But don't call into KVM directly.