Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock

Ben Gardon <bgardon@xxxxxxxxxx> · Tue, 26 Jan 2021 13:47:36 -0800

On Tue, Jan 26, 2021 at 6:28 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 21/01/21 22:32, Sean Christopherson wrote:
> > Coming back to this series, I wonder if the RCU approach is truly necessary to
> > get the desired scalability.  If both zap_collapsible_sptes() and NX huge page
> > recovery zap_only_  leaf SPTEs, then the only path that can actually unlink a
> > shadow page while holding the lock for read is the page fault path that installs
> > a huge page over an existing shadow page.
> >
> > Assuming the above analysis is correct, I think it's worth exploring alternatives
> > to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in
> > the specific case of overwriting a SP (though that may not exist for rwlocks),
> > or maybe something entirely different?
>
> You can do the deferred freeing with a short write-side critical section
> to ensure all readers have terminated.
>
> If the bool argument to handle_disconnected_tdp_mmu_page is true(*), the
> pages would be added to an llist, instead of being freed immediately.
> At the end of a shared critical section you would do
>
>         if (!llist_empty(&kvm->arch.tdp_mmu_disconnected_pages)) {
>                 struct llist_node *first;
>                 kvm_mmu_lock(kvm);
>                 first = __list_del_all(&kvm->arch.tdp_mmu_disconnected_pages);
>                 kvm_mmu_unlock(kvm);
>
>                 /*
>                  * All vCPUs have already stopped using the pages when
>                  * their TLBs were flushed.  The exclusive critical
>                  * section above means that there can be no readers
>                  * either.
>                  */
>                 tdp_mmu_free_disconnected_pages(first);
>         }
>
> So this is still deferred reclamation, but it's done by one of the vCPUs
> rather than a worker RCU thread.  This would replace patches 11/12/13
> and probably would be implemented after patch 18.

While I agree that this would work, it could be a major performance
bottleneck as it could result in the MMU lock being acquired in read
mode by a VCPU thread handling a page fault. Even though the critical
section is very short it still has to serialize with the potentially
many overlapping page fault handlers which want the MMU read lock. In
order to perform well with hundreds of vCPUs, the vCPU threads really
cannot be acquiring the MMU lock in write mode. The MMU lock above
could be replaced with the TDP MMU pages lock, but that still adds
serialization where it's not really necessary.
The use of RCU also provides a nice separation of concerns, freeing
the various functions which need to remove pages from the paging
structure from having to follow up on freeing them later.

>
> Paolo
>
> (*) this idea is what prompted the comment about s/atomic/shared/
>