On Wed, 2024-02-21 at 17:26 -0800, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock, and without even faulting > the page into the primary MMU, if the resolved gfn is covered by an active > invalidation. Contending for mmu_lock is especially problematic on > preemptible kernels as the mmu_notifier invalidation task will yield > mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Faulting the page into the primary MMU is similarly problematic, as doing > so may acquire locks that need to be taken for the invalidation to > complete (the primary MMU has finer grained locks than KVM's MMU), and/or > may cause unnecessary churn (getting/putting pages, marking them accessed, > etc). > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Add a dedicated lockess version of the range-based retry check to avoid > false positives on the sanity check on start+end WARN, and so that it's > super obvious that checking for a racing invalidation without holding > mmu_lock is unsafe (though obviously useful). > > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking > invalidation in a loop won't put KVM into an infinite loop, e.g. due to > caching the in-progress flag and never seeing it go to '0'. > > Force a load of mmu_invalidate_seq as well, even though it isn't strictly > necessary to avoid an infinite loop, as doing so improves the probability > that KVM will detect an invalidation that already completed before > acquiring mmu_lock and bailing anyways. > > Do the pre-check even for non-preemptible kernels, as waiting to detect > the invalidation until mmu_lock is held guarantees the vCPU will observe > the worst case latency in terms of handling the fault, and can generate > even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, > detect retry, drop mmu_lock, re-enter the guest, retake the fault, and > eventually re-acquire mmu_lock. This behavior is also why there are no > new starvation issues due to losing the fairness guarantees provided by > rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting > on mmu_lock doesn't guarantee forward progress in the face of _another_ > mmu_notifier invalidation event. > > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() > may generate a load into a register instead of doing a direct comparison > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost > is a few bytes of code and maaaaybe a cycle or three. > > Reported-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx > Reported-by: Friedrich Weber <f.weber@xxxxxxxxxxx> > Cc: Kai Huang <kai.huang@xxxxxxxxx> > Cc: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Cc: Yuan Yao <yuan.yao@xxxxxxxxxxxxxxx> > Cc: Xu Yilun <yilun.xu@xxxxxxxxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > Acked-by: Kai Huang <kai.huang@xxxxxxxxx>