Re: [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"

David Matlack <dmatlack@xxxxxxxxxx> · Fri, 16 Aug 2024 16:35:14 -0700

On Fri, Aug 16, 2024 at 3:47 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Aug 05, 2024, David Matlack wrote:
> > This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.
> >
> > Bring back the logic that walks down to leafs when zapping collapsible
> > SPTEs. Stepping down to leafs is technically unnecessary when zapping,
> > but the leaf SPTE will be used in a subsequent commit to construct a
> > huge SPTE and recover the huge mapping in place.
>
> Please no, getting rid of the step-up code made me so happy. :-D
>
> It's also suboptimal, e.g. in the worst case scenario (which is comically
> unlikely, but theoretically possible), if the first present leaf SPTE in a 1GiB
> region is the last SPTE in the last 2MiB range, and all non-leaf SPTEs are
> somehow present (this is the super unlikely part), then KVM will read 512*512
> SPTEs before encountering a shadow-present leaf SPTE.

Haha, I didn't consider that case. That seems extremely unlikely. And
even if it did happen, KVM is holding mmu_lock for read and is able to
drop the lock and yield. So I don't see a strong need to structure the
approach around that edge case.

>
> The proposed approach will also ignore nx_huge_page_disallowed, and just always
> create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
> downside is that it means KVM is pretty much guaranteed to force the guest to
> re-fault all of its code pages, and zap a non-trivial number of huge pages that
> were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
> use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
> different than zap+create (if the guest is still using the region for code).

I'm ok with skipping nx_huge_page_disallowed pages during disable-dirty-log.

But why is recovering in-place is worse/different than zapping? They
both incur the same TLB flushes. And recovering might even result in
less vCPU faults, since vCPU faults use tdp_mmu_split_huge_page() to
install a fully populated lower level page table (vs faulting on a
zapped mapping will just install a 4K SPTE).

>
> So rather than looking for a present leaf SPTE, what about "stopping" as soon as
> KVM find a SP that can be replaced with a huge SPTE, pre-checking
> nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
> SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
> massage kvm_tdp_map_page() into a usable form.

My intuition is that going through the full page fault flow would be
more expensive than just stepping down+up in 99.99999% of cases. And
will require more code churn. So it doesn't seem like a win?

>
> By virtue of there being a present non-leaf SPTE, KVM knows the guest accessed
> the region at some point.  And now that MTRR virtualization is gone, the only time
> KVM zaps _only_ leafs is for mmu_notifiers and and the APIC access page, i.e. the
> odds of stepping down NOT finding a present SPTE somewhere in the region is very
> small.  Lastly, kvm_mmu_max_mapping_level() has verified there is a valid mapping
> in the pr imary MMU, else the max level would be PG_LEVEL_4K.  So the odds of
> getting a false positive and effectively pre-faulting memory the guest isn't using
> are quite small.