Re: [PATCH] KVM: x86/mmu: Process atomically-zapped SPTEs after replacing REMOVED_SPTE

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 9 Apr 2024 16:31:17 -0700

On Thu, Mar 07, 2024, David Matlack wrote:
> Process SPTEs zapped under the read-lock after the TLB flush and
> replacement of REMOVED_SPTE with 0. This minimizes the contention on the
> child SPTEs (if zapping an SPTE that points to a page table) and
> minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE.

I think it's worth explicitly calling out the impact of each part of the change,
i.e. that flushing TLBs avoids child SPTE contention, and waiting until the SPTE
is back to '0' avoids blocking vCPUs.  That's kinda sorta implied by the ordering,
but I'm guessing it won't be super obvious to folks that aren't already familiar
with the TDP MMU.

> In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to

large (400+) "number of" vCPUs

> process a 1GiB region mapped with 4KiB entries, e.g. when disabling
> dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a
> vCPU accesses the 1GiB region being zapped it will be stalled until KVM
> finishes processing the SPTE and replaces the REMOVED_SPTE with 0.

...

> KVM's processing of collapsible SPTEs is still extremely slow and can be
> improved. For example, a significant amount of time is spent calling
> kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is
> redundant when processing SPTEs that all map the folio.

"same folio"

I also massaged this to make it clear that this is a "hey, by the way" sorta
thing, and that _this_ patch is valuable even if/when we go clean up the A/D
performance mess.

> +	 * This should be safe because KVM does not depend on any of the

Don't hedge.  If we're wrong and it's not safe, then there will be revert with
a changelog explaining exactly why we're wrong.  But in the (hopefully) likely
case that we're correct, hedging only confuses future readers, e.g. unnecessarily
makes them wonder if maaaaybe this code isn't actually safe.

> +	 * processing completing before a new SPTE is installed to map a given
> +	 * GFN. Case in point, kvm_mmu_zap_all_fast() can result in KVM
> +	 * processing all SPTEs in a given root after vCPUs create mappings in
> +	 * a new root.

This belongs in the changelog, as it references a very specific point in time.
It's extremely unlikely, but if kvm_mmu_zap_all_fast()'s behavior changed then
we'd end up with a stale comment that doesn't actually provide much value to the
code it's commenting.

I'll fix up all of the above when applying (it's basically just me obsessing over
the changelog).