On Wed, May 15, 2024, Maxim Levitsky wrote: > Small note on why we started seeing this failure on RHEL 9 and only on some machines: > > - RHEL9 has MGLRU enabled, RHEL8 doesn't. For a stopgap in KVM selftests, or possibly even a long term solution in case the decision is that page_idle will simply have different behavior for MGLRU, couldn't we tweak the test to not assert if MGRLU is enabled? E.g. refactor get_module_param_integer() and/or get_module_param() to add get_sysfs_value_integer() or so, and then do this? diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c index 3c7defd34f56..1e759df36098 100644 --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c @@ -123,6 +123,11 @@ static void mark_page_idle(int page_idle_fd, uint64_t pfn) "Set page_idle bits for PFN 0x%" PRIx64, pfn); } +static bool is_lru_gen_enabled(void) +{ + return !!get_sysfs_value_integer("/sys/kernel/mm/lru_gen/enabled"); +} + static void mark_vcpu_memory_idle(struct kvm_vm *vm, struct memstress_vcpu_args *vcpu_args) { @@ -185,7 +190,8 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm, */ if (still_idle >= pages / 10) { #ifdef __x86_64__ - TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR), + TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR) || + is_lru_gen_enabled(), "vCPU%d: Too many pages still idle (%lu out of %lu)", vcpu_idx, still_idle, pages); #endif > - machine needs to have more than one NUMA node because NUMA balancing > (enabled by default) tries apparently to write protect the primary PTEs > of (all?) processes every few seconds, and that causes KVM to flush the secondary PTEs: > (at least with new tdp mmu) > > access_tracking-3448 [091] ....1.. 1380.244666: handle_changed_spte <-tdp_mmu_set_spte > access_tracking-3448 [091] ....1.. 1380.244667: <stack trace> > => cdc_driver_init > => handle_changed_spte > => tdp_mmu_set_spte > => tdp_mmu_zap_leafs > => kvm_tdp_mmu_unmap_gfn_range > => kvm_unmap_gfn_range > => kvm_mmu_notifier_invalidate_range_start > => __mmu_notifier_invalidate_range_start > => change_p4d_range > => change_protection > => change_prot_numa > => task_numa_work > => task_work_run > => exit_to_user_mode_prepare > => syscall_exit_to_user_mode > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > It's a separate question, if the NUMA balancing should do this, or if NUMA > balancing should be enabled by default, FWIW, IMO, enabling NUMA balancing on a system whose primary purpose is to run VMs is bad idea. NUMA balancing operates under the assumption that a !PRESENT #PF is relatively cheap. When secondary MMUs are involved, that is simply not the case, e.g. to honor the mmu_notifer event, KVM zaps _and_ does a remote TLB flush. Even if we reworked KVM and/or the mmu_notifiers so that KVM didn't need to do such a heavy operation, the cost of page fault VM-Exit is significantly higher than the cost of a host #PF. > because there are other reasons that can force KVM to invalidate the > secondary mappings and trigger this issue. Ya.