access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU is in use

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Wed, 15 May 2024 19:39:35 -0400

Hi,

I would like to share a long rabbit hole dive I did some time ago on why access_tracking_perf_test test sometimes 
fails and why it fails on only some RHEL9 machines.

When it fails you see an error like this:

Populating memory : 0.693662489s
Writing to populated memory : 0.022868074s
Reading from populated memory : 0.009497503s
Mark memory idle : 2.206361533s
Writing to idle memory : 0.282340559s
==== Test Assertion Failure ====
access_tracking_perf_test.c:188: this_cpu_has(X86_FEATURE_HYPERVISOR)
pid=78914 tid=78918 errno=4 - Interrupted system call
1 0x0000000000402e99: mark_vcpu_memory_idle at access_tracking_perf_test.c:188
2 (inlined by) vcpu_thread_main at access_tracking_perf_test.c:240
3 0x000000000040745d: vcpu_thread_main at memstress.c:283
4 0x00007f68e66a1911: ?? ??:0
5 0x00007f68e663f44f: ?? ??:0
vCPU0: Too many pages still idle (123013 out of 262144)

access_tracking_perf_test uses '/sys/kernel/mm/page_idle/bitmap' interface to: 

	- runs a guest once which writes to its memory pages and thus allocates and dirties them.

	- clear A/D bits of the primary and secondary translation of guest pages
	   (note that it clears the bits in the actual PTEs only)

	- set so called 'idle' page flags bit on these pages

	  (this bit is only used for page_idle private usage, it is not used in generic mm code, because
	   generic mm code only tracks dirty and not accessed page status)

	- runs again the guest which dirties those memory pages again.

	- uses the same 'page_idle' interface to check that most (90%) of the guest pages are now accessed again.

	  in terms of page_idle code, it will tell that page is not idle (=accessed) if either:
		- idle bit of it is clear
		- A/D bits are set in primary or secondary PTEs that map this page 
		  (in this case it will also clear the idle bit,
		   so that subsequent queries won't need to check the PTEs again)

The problem is that sometimes the secondary translations (that is SPTEs) are destroyed/flushed by KVM 
which causes KVM to mark guest pages which were mapped through these SPTEs as accessed:

KVM calls kvm_set_pfn_accessed and this call eventually leads to folio_mark_accessed().

This function used to clear the idle bit of the page.
(but note that it would not set accessed bits in the primary translation of this page!)

But now when MGLRU is enabled it doesn't do this anymore:

void folio_mark_accessed(struct folio *folio)
{
	if (lru_gen_enabled()) {
		folio_inc_refs(folio);
		return;
	}

	....

	if (folio_test_idle(folio))
		folio_clear_idle(folio);
}
EXPORT_SYMBOL(folio_mark_accessed);

Thus when the page_idle code checks the page, it sees no A/D bits in primary translation,
no A/D bits in secondary translation (because it doesn't exist), and idle bit set,
so it considers the page idle, that is not accessed.

There is a patch series that seems to fix this, but it seems that it wasn't accepted upstream,
I don't know what is the current status of this work.

https://patchew.org/linux/951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@xxxxxxxxxxxx/

Now the question is, what do you think we should do to fix it? 
Should we at least disable page_idle interface when MGLRU is enabled?

Best regards,
	Maxim Levitsky

PS:

Small note on why we started seeing this failure on RHEL 9 and only on some machines: 

	- RHEL9 has MGLRU enabled, RHEL8 doesn't.

	- machine needs to have more than one NUMA node because NUMA balancing 
	  (enabled by default) tries apparently to write protect the primary PTEs 
	  of (all?) processes every few seconds, and that causes KVM to flush the secondary PTEs:
	  (at least with new tdp mmu)

access_tracking-3448    [091] ....1..  1380.244666: handle_changed_spte <-tdp_mmu_set_spte
 access_tracking-3448    [091] ....1..  1380.244667: <stack trace>
 => cdc_driver_init
 => handle_changed_spte
 => tdp_mmu_set_spte
 => tdp_mmu_zap_leafs
 => kvm_tdp_mmu_unmap_gfn_range
 => kvm_unmap_gfn_range
 => kvm_mmu_notifier_invalidate_range_start
 => __mmu_notifier_invalidate_range_start
 => change_p4d_range
 => change_protection
 => change_prot_numa
 => task_numa_work
 => task_work_run
 => exit_to_user_mode_prepare
 => syscall_exit_to_user_mode
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

It's a separate question, if the NUMA balancing should do this, or if NUMA balancing should be enabled by default,
because there are other reasons that can force KVM to invalidate the secondary mappings and trigger this issue.