Re: The root cause of failure of access_tracking_perf_test in a nested guest

Emanuele Giuseppe Esposito <eesposit@xxxxxxxxxx> · Fri, 23 Sep 2022 13:57:01 +0200

Am 23/09/2022 um 12:16 schrieb Maxim Levitsky:
> Hi!
> 
> Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
> fails when run in a nested guest on Intel, and I finally was able to find the root casue.
> 
> So the access_tracking_perf_test tests the following:
> 
> - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
> file which allows a process to set/clear the accessed bit in its page tables.
> the interface of this file is inverted, it is a bitmap of 'idle' bits
> Idle bit set === dirty bit is clear.
> 
> - It then runs a KVM guest, and checks that when the guest accesses its memory
> (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.
> 
> In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
> and then runs a guest which reads/writes all its memory, and then
> it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.
> 
> 
> 
> Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
> - kvm_mmu_notifier_clear_flush_young
> - kvm_mmu_notifier_clear_young
> - kvm_mmu_notifier_test_young
> 
> First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.
> 
> The difference between the first two notifiers is that the first one flushes EPT/NPT,
> and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.
> 
> This means that on the bare metal, the tlb might still have the accessed bit set, and thus
> it might not set it again in the PTE when a memory access is done through it.
> 
> There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
> done on purpose.
> 
> I would like to hear your opinion on why it was done this way, and if the original reasons for
> not doing the tlb flush are still valid.
> 
> Now why the access_tracking_perf_test fails in a nested guest?
> It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
> is not bounded by size, because it is stored in the unsync sptes in memory.
> 
> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> the memory is not intercepted and because of this doesn't turn back
> the accessed bit in the guest EPT tables.
> 
> (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
> keep sptes for gptes with no accessed bit)

As suggested by Paolo, I also tried changing page_idle.c implementation so that it would call kvm_mmu_notifier_clear_flush_young instead of its non-flush counterpart:

diff --git a/mm/page_idle.c b/mm/page_idle.c
index edead6a8a5f9..ffc1b0182534 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -62,10 +62,10 @@ static bool page_idle_clear_pte_refs_one(struct page *page,
                         * For PTE-mapped THP, one sub page is referenced,
                         * the whole THP is referenced.
                         */
-                       if (ptep_clear_young_notify(vma, addr, pvmw.pte))
+                       if (ptep_clear_flush_young_notify(vma, addr, pvmw.pte))
                                referenced = true;
                } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-                       if (pmdp_clear_young_notify(vma, addr, pvmw.pmd))
+                       if (pmdp_clear_flush_young_notify(vma, addr, pvmw.pmd))
                                referenced = true;
                } else {
                        /* unexpected pmd-mapped page? */

As expected, with the above patch the test does not fail anymore, proving Maxim's point.
As I understand an alternative was to get rid of the test? Or at least move it outside from kvm?

Thank you,
Emanuele

> 
> 
> Any comments are welcome!
> 
> If you think that the lack of the EPT flush is still the right thing to do,
> I vote again to have at least some form of a blacklist of selftests which
> are expected to fail, when run under KVM (fix_hypercall_test is the other test
> I already know that fails in a KVM guest, also without a practical way to fix it).
> 
> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
> means that L0 syncs all the page tables.
> 
> Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.
> 
> Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
> fails once in a while, likely because of timing and/or different implementation.
> 
> 
>