Re: access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU is in use

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 21 May 2024 16:29:54 -0700

On Wed, May 15, 2024, Maxim Levitsky wrote:
> Small note on why we started seeing this failure on RHEL 9 and only on some machines: 
> 
> 	- RHEL9 has MGLRU enabled, RHEL8 doesn't.

For a stopgap in KVM selftests, or possibly even a long term solution in case the
decision is that page_idle will simply have different behavior for MGLRU, couldn't
we tweak the test to not assert if MGRLU is enabled?

E.g. refactor get_module_param_integer() and/or get_module_param() to add
get_sysfs_value_integer() or so, and then do this?

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..1e759df36098 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -123,6 +123,11 @@ static void mark_page_idle(int page_idle_fd, uint64_t pfn)
                    "Set page_idle bits for PFN 0x%" PRIx64, pfn);
 }
 
+static bool is_lru_gen_enabled(void)
+{
+       return !!get_sysfs_value_integer("/sys/kernel/mm/lru_gen/enabled");
+}
+
 static void mark_vcpu_memory_idle(struct kvm_vm *vm,
                                  struct memstress_vcpu_args *vcpu_args)
 {
@@ -185,7 +190,8 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
         */
        if (still_idle >= pages / 10) {
 #ifdef __x86_64__
-               TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
+               TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR) ||
+                           is_lru_gen_enabled(),
                            "vCPU%d: Too many pages still idle (%lu out of %lu)",
                            vcpu_idx, still_idle, pages);
 #endif

> 	- machine needs to have more than one NUMA node because NUMA balancing 
> 	  (enabled by default) tries apparently to write protect the primary PTEs 
> 	  of (all?) processes every few seconds, and that causes KVM to flush the secondary PTEs:
> 	  (at least with new tdp mmu)
> 
> access_tracking-3448    [091] ....1..  1380.244666: handle_changed_spte <-tdp_mmu_set_spte
>  access_tracking-3448    [091] ....1..  1380.244667: <stack trace>
>  => cdc_driver_init
>  => handle_changed_spte
>  => tdp_mmu_set_spte
>  => tdp_mmu_zap_leafs
>  => kvm_tdp_mmu_unmap_gfn_range
>  => kvm_unmap_gfn_range
>  => kvm_mmu_notifier_invalidate_range_start
>  => __mmu_notifier_invalidate_range_start
>  => change_p4d_range
>  => change_protection
>  => change_prot_numa
>  => task_numa_work
>  => task_work_run
>  => exit_to_user_mode_prepare
>  => syscall_exit_to_user_mode
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> 
> It's a separate question, if the NUMA balancing should do this, or if NUMA
> balancing should be enabled by default,

FWIW, IMO, enabling NUMA balancing on a system whose primary purpose is to run VMs
is bad idea.  NUMA balancing operates under the assumption that a !PRESENT #PF is
relatively cheap.  When secondary MMUs are involved, that is simply not the case,
e.g. to honor the mmu_notifer event, KVM zaps _and_ does a remote TLB flush.  Even
if we reworked KVM and/or the mmu_notifiers so that KVM didn't need to do such a
heavy operation, the cost of page fault VM-Exit is significantly higher than the
cost of a host #PF.

> because there are other reasons that can force KVM to invalidate the
> secondary mappings and trigger this issue.

Ya.