Re: The root cause of failure of access_tracking_perf_test in a nested guest

Jim Mattson <jmattson@xxxxxxxxxx> · Fri, 23 Sep 2022 12:25:00 -0700

On Fri, Sep 23, 2022 at 3:16 AM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:
>
> Hi!
>
> Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
> fails when run in a nested guest on Intel, and I finally was able to find the root casue.
>
> So the access_tracking_perf_test tests the following:
>
> - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
> file which allows a process to set/clear the accessed bit in its page tables.
> the interface of this file is inverted, it is a bitmap of 'idle' bits
> Idle bit set === dirty bit is clear.
>
> - It then runs a KVM guest, and checks that when the guest accesses its memory
> (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.
>
> In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
> and then runs a guest which reads/writes all its memory, and then
> it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.
>
>
>
> Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
> - kvm_mmu_notifier_clear_flush_young
> - kvm_mmu_notifier_clear_young
> - kvm_mmu_notifier_test_young
>
> First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.
>
> The difference between the first two notifiers is that the first one flushes EPT/NPT,
> and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.
>
> This means that on the bare metal, the tlb might still have the accessed bit set, and thus
> it might not set it again in the PTE when a memory access is done through it.
>
> There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
> done on purpose.
>
> I would like to hear your opinion on why it was done this way, and if the original reasons for
> not doing the tlb flush are still valid.
>
> Now why the access_tracking_perf_test fails in a nested guest?
> It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
> is not bounded by size, because it is stored in the unsync sptes in memory.
>
> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> the memory is not intercepted and because of this doesn't turn back
> the accessed bit in the guest EPT tables.

Does the guest execute an INVEPT after clearing the accessed bit?

>From volume 3 of the SDM, section 28.3.5 Accessed and Dirty Flags for EPT:

> A processor may cache information from the EPT paging-structure entries in TLBs and paging-structure caches (see Section 28.4). This fact implies that, if software changes an accessed flag or a dirty flag from 1 to 0, the processor might not set the corresponding bit in memory on a subsequent access using an affected guest-physical address.

> (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
> keep sptes for gptes with no accessed bit)
>
>
> Any comments are welcome!
>
> If you think that the lack of the EPT flush is still the right thing to do,
> I vote again to have at least some form of a blacklist of selftests which
> are expected to fail, when run under KVM (fix_hypercall_test is the other test
> I already know that fails in a KVM guest, also without a practical way to fix it).
>
>
> Best regards,
>         Maxim Levitsky
>
>
> PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
> means that L0 syncs all the page tables.
>
> Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.
>
> Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
> fails once in a while, likely because of timing and/or different implementation.
>
>
>