Re: [PATCH 1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 22 Aug 2024 08:10:05 -0700

On Fri, Aug 16, 2024, David Matlack wrote:
> On Fri, Aug 16, 2024 at 3:47 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > The proposed approach will also ignore nx_huge_page_disallowed, and just always
> > create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
> > downside is that it means KVM is pretty much guaranteed to force the guest to
> > re-fault all of its code pages, and zap a non-trivial number of huge pages that
> > were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
> > use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
> > different than zap+create (if the guest is still using the region for code).
> 
> I'm ok with skipping nx_huge_page_disallowed pages during disable-dirty-log.
> 
> But why is recovering in-place is worse/different than zapping? They
> both incur the same TLB flushes. And recovering might even result in
> less vCPU faults, since vCPU faults use tdp_mmu_split_huge_page() to
> install a fully populated lower level page table (vs faulting on a
> zapped mapping will just install a 4K SPTE).

Doh, never mind, I was thinking zapping collapsible SPTEs zapped leafs to induce
faults, but it does the opposite and zaps at the level KVM thinks can be huge.

> > So rather than looking for a present leaf SPTE, what about "stopping" as soon as
> > KVM find a SP that can be replaced with a huge SPTE, pre-checking
> > nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
> > SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
> > massage kvm_tdp_map_page() into a usable form.
> 
> My intuition is that going through the full page fault flow would be
> more expensive than just stepping down+up in 99.99999% of cases.

Hmm, yeah, doing the full fault flow isn't the right place to hook in.  Ugh, right,
and it's the whole problem with not having a vCPU for tdp_mmu_map_handle_target_level().
But that's solvable as it's really just is_rsvd_spte(), which I would be a-ok
skipping.  Ah, no, host_writable is also problematic.  Blech.

That's solvable too, e.g. host_pfn_mapping_level() could get the protection, but
that would require checking for an ongoing mmu_notifier invalidation.  So again,
probably not worth it.  Double blech.

> And will require more code churn.

I'm not terribly concerned with code churn.  I care much more about long-term
maintenance, and especially about having multiple ways of doing the same thing
(installing a shadow-present leaf SPTE).  But I agree that trying to remedy that
last point (similar flows) is probably a fool's errand in this case, as creating
a new SPTE from scratch really is different than up-leveling an existing SPTE.

I still have concerns about the step-up code, but I'll respond to those in the
context of the patch I think is problematic.