Am 23/09/2022 um 12:16 schrieb Maxim Levitsky: > Hi! > > Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test > fails when run in a nested guest on Intel, and I finally was able to find the root casue. > > So the access_tracking_perf_test tests the following: > > - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable > file which allows a process to set/clear the accessed bit in its page tables. > the interface of this file is inverted, it is a bitmap of 'idle' bits > Idle bit set === dirty bit is clear. > > - It then runs a KVM guest, and checks that when the guest accesses its memory > (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap. > > In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap, > and then runs a guest which reads/writes all its memory, and then > it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap. > > > > Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular > - kvm_mmu_notifier_clear_flush_young > - kvm_mmu_notifier_clear_young > - kvm_mmu_notifier_test_young > > First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value. > > The difference between the first two notifiers is that the first one flushes EPT/NPT, > and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one. > > This means that on the bare metal, the tlb might still have the accessed bit set, and thus > it might not set it again in the PTE when a memory access is done through it. > > There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be > done on purpose. > > I would like to hear your opinion on why it was done this way, and if the original reasons for > not doing the tlb flush are still valid. > > Now why the access_tracking_perf_test fails in a nested guest? > It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which > is not bounded by size, because it is stored in the unsync sptes in memory. > > Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't > notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to > the memory is not intercepted and because of this doesn't turn back > the accessed bit in the guest EPT tables. > > (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't > keep sptes for gptes with no accessed bit) As suggested by Paolo, I also tried changing page_idle.c implementation so that it would call kvm_mmu_notifier_clear_flush_young instead of its non-flush counterpart: diff --git a/mm/page_idle.c b/mm/page_idle.c index edead6a8a5f9..ffc1b0182534 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -62,10 +62,10 @@ static bool page_idle_clear_pte_refs_one(struct page *page, * For PTE-mapped THP, one sub page is referenced, * the whole THP is referenced. */ - if (ptep_clear_young_notify(vma, addr, pvmw.pte)) + if (ptep_clear_flush_young_notify(vma, addr, pvmw.pte)) referenced = true; } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { - if (pmdp_clear_young_notify(vma, addr, pvmw.pmd)) + if (pmdp_clear_flush_young_notify(vma, addr, pvmw.pmd)) referenced = true; } else { /* unexpected pmd-mapped page? */ As expected, with the above patch the test does not fail anymore, proving Maxim's point. As I understand an alternative was to get rid of the test? Or at least move it outside from kvm? Thank you, Emanuele > > > Any comments are welcome! > > If you think that the lack of the EPT flush is still the right thing to do, > I vote again to have at least some form of a blacklist of selftests which > are expected to fail, when run under KVM (fix_hypercall_test is the other test > I already know that fails in a KVM guest, also without a practical way to fix it). > > > Best regards, > Maxim Levitsky > > > PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which > means that L0 syncs all the page tables. > > Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT. > > Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still > fails once in a while, likely because of timing and/or different implementation. > > >