Re: [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 16 Aug 2024 15:47:47 -0700

On Mon, Aug 05, 2024, David Matlack wrote:
> This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.
> 
> Bring back the logic that walks down to leafs when zapping collapsible
> SPTEs. Stepping down to leafs is technically unnecessary when zapping,
> but the leaf SPTE will be used in a subsequent commit to construct a
> huge SPTE and recover the huge mapping in place.

Please no, getting rid of the step-up code made me so happy. :-D

It's also suboptimal, e.g. in the worst case scenario (which is comically
unlikely, but theoretically possible), if the first present leaf SPTE in a 1GiB
region is the last SPTE in the last 2MiB range, and all non-leaf SPTEs are
somehow present (this is the super unlikely part), then KVM will read 512*512
SPTEs before encountering a shadow-present leaf SPTE.

The proposed approach will also ignore nx_huge_page_disallowed, and just always
create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
downside is that it means KVM is pretty much guaranteed to force the guest to
re-fault all of its code pages, and zap a non-trivial number of huge pages that
were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
different than zap+create (if the guest is still using the region for code).

So rather than looking for a present leaf SPTE, what about "stopping" as soon as
KVM find a SP that can be replaced with a huge SPTE, pre-checking
nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
massage kvm_tdp_map_page() into a usable form.

By virtue of there being a present non-leaf SPTE, KVM knows the guest accessed
the region at some point.  And now that MTRR virtualization is gone, the only time
KVM zaps _only_ leafs is for mmu_notifiers and and the APIC access page, i.e. the
odds of stepping down NOT finding a present SPTE somewhere in the region is very
small.  Lastly, kvm_mmu_max_mapping_level() has verified there is a valid mapping
in the pr imary MMU, else the max level would be PG_LEVEL_4K.  So the odds of
getting a false positive and effectively pre-faulting memory the guest isn't using
are quite small.