Re: The root cause of failure of access_tracking_perf_test in a nested guest

David Matlack <dmatlack@xxxxxxxxxx> · Fri, 23 Sep 2022 10:30:48 -0700

On Fri, Sep 23, 2022 at 01:16:04PM +0300, Maxim Levitsky wrote:
> Hi!
> 
> Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
> fails when run in a nested guest on Intel, and I finally was able to find the root casue.
> 
> So the access_tracking_perf_test tests the following:
> 
> - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
> file which allows a process to set/clear the accessed bit in its page tables.
> the interface of this file is inverted, it is a bitmap of 'idle' bits
> Idle bit set === dirty bit is clear.
> 
> - It then runs a KVM guest, and checks that when the guest accesses its memory
> (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.
> 
> In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
> and then runs a guest which reads/writes all its memory, and then
> it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.
> 
> 
> 
> Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
> - kvm_mmu_notifier_clear_flush_young
> - kvm_mmu_notifier_clear_young
> - kvm_mmu_notifier_test_young
> 
> First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.
> 
> The difference between the first two notifiers is that the first one flushes EPT/NPT,
> and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.
> 
> This means that on the bare metal, the tlb might still have the accessed bit set, and thus
> it might not set it again in the PTE when a memory access is done through it.
> 
> There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
> done on purpose.
> 
> I would like to hear your opinion on why it was done this way, and if the original reasons for
> not doing the tlb flush are still valid.
> 
> Now why the access_tracking_perf_test fails in a nested guest?
> It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
> is not bounded by size, because it is stored in the unsync sptes in memory.
> 
> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> the memory is not intercepted and because of this doesn't turn back
> the accessed bit in the guest EPT tables.
> 
> (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
> keep sptes for gptes with no accessed bit)
> 
> 
> Any comments are welcome!
> 
> If you think that the lack of the EPT flush is still the right thing to do,
> I vote again to have at least some form of a blacklist of selftests which
> are expected to fail, when run under KVM (fix_hypercall_test is the other test
> I already know that fails in a KVM guest, also without a practical way to fix it).

Nice find. I don't recommend changing page_idle just for this test.

I added this test to evaluate the performance of KVM's access tracking
faulting handling, e.g. for when eptad=N. page_idle just happens to be
the only userspace mechanism available today to exercise access
tracking. But it has serious downsides as you discovered and are
documented at the top of the test:

/*
 ...
 * Note that a deterministic correctness test of access tracking is not possible
 * by using page_idle as it exists today. This is for a few reasons:
 *
 * 1. page_idle only issues clear_young notifiers, which lack a TLB flush. This
 *    means subsequent guest accesses are not guaranteed to see page table
 *    updates made by KVM until some time in the future.
 *
 * 2. page_idle only operates on LRU pages. Newly allocated pages are not
 *    immediately allocated to LRU lists. Instead they are held in a "pagevec",
 *    which is drained to LRU lists some time in the future. There is no
 *    userspace API to force this drain to occur.
 *
 * These limitations are worked around in this test by using a large enough
 * region of memory for each vCPU such that the number of translations cached in
 * the TLB and the number of pages held in pagevecs are a small fraction of the
 * overall workload. And if either of those conditions are not true this test
 * will fail rather than silently passing.
 ...
 */

When I wrote the test, I did not realize that nested effectively has an
unlimited TLB since shadow pages can just be left unsync. So the comment
above does not hold for nested.

My recommendation to move forward would be to get rid of this
TEST_ASSERT():

	TEST_ASSERT(still_idle < pages / 10,
		    "vCPU%d: Too many pages still idle (%"PRIu64 " out of %"
		    PRIu64 ").\n",
		    vcpu_idx, still_idle, pages);

And instead just print a warning message to tell the user that memory is
not being marked idle, and that will affect the performance results (not
as many access will actually go through access tracking). This will stop
the test from failing.

Long term, it would be great to switch to a more deterministic userspace
mechanism to trigger access tracking. My understanding is the new
multi-gen LRU that is slated for 6.1 or 6.2 might provide a better
option.