On Fri, Sep 23, 2022 at 3:16 AM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote: > > Hi! > > Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test > fails when run in a nested guest on Intel, and I finally was able to find the root casue. > > So the access_tracking_perf_test tests the following: > > - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable > file which allows a process to set/clear the accessed bit in its page tables. > the interface of this file is inverted, it is a bitmap of 'idle' bits > Idle bit set === dirty bit is clear. > > - It then runs a KVM guest, and checks that when the guest accesses its memory > (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap. > > In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap, > and then runs a guest which reads/writes all its memory, and then > it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap. > > > > Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular > - kvm_mmu_notifier_clear_flush_young > - kvm_mmu_notifier_clear_young > - kvm_mmu_notifier_test_young > > First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value. > > The difference between the first two notifiers is that the first one flushes EPT/NPT, > and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one. > > This means that on the bare metal, the tlb might still have the accessed bit set, and thus > it might not set it again in the PTE when a memory access is done through it. > > There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be > done on purpose. > > I would like to hear your opinion on why it was done this way, and if the original reasons for > not doing the tlb flush are still valid. > > Now why the access_tracking_perf_test fails in a nested guest? > It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which > is not bounded by size, because it is stored in the unsync sptes in memory. > > Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't > notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to > the memory is not intercepted and because of this doesn't turn back > the accessed bit in the guest EPT tables. Does the guest execute an INVEPT after clearing the accessed bit? >From volume 3 of the SDM, section 28.3.5 Accessed and Dirty Flags for EPT: > A processor may cache information from the EPT paging-structure entries in TLBs and paging-structure caches (see Section 28.4). This fact implies that, if software changes an accessed flag or a dirty flag from 1 to 0, the processor might not set the corresponding bit in memory on a subsequent access using an affected guest-physical address. > (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't > keep sptes for gptes with no accessed bit) > > > Any comments are welcome! > > If you think that the lack of the EPT flush is still the right thing to do, > I vote again to have at least some form of a blacklist of selftests which > are expected to fail, when run under KVM (fix_hypercall_test is the other test > I already know that fails in a KVM guest, also without a practical way to fix it). > > > Best regards, > Maxim Levitsky > > > PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which > means that L0 syncs all the page tables. > > Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT. > > Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still > fails once in a while, likely because of timing and/or different implementation. > > >