Re: [PATCH 19/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

Ben Gardon <bgardon@xxxxxxxxxx> · Tue, 23 Nov 2021 11:35:28 -0800

On Mon, Nov 22, 2021 at 5:16 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Nov 22, 2021, Ben Gardon wrote:
> > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > @@ -755,6 +759,26 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> > >         return false;
> > >  }
> > >
> > > +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > +{
> > > +       u64 old_spte;
> > > +
> > > +       rcu_read_lock();
> > > +
> > > +       old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> > > +       if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
> > > +               rcu_read_unlock();
> > > +               return false;
> > > +       }
> > > +
> > > +       __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> > > +                          sp->gfn, sp->role.level + 1, true, true);
> > > +
> > > +       rcu_read_unlock();
> > > +
> > > +       return true;
> > > +}
> > > +
> >
> > Ooooh this makes me really nervous. There are a lot of gotchas to
> > modifying SPTEs in a new context without traversing the paging
> > structure like this. For example, we could modify an SPTE under an
> > invalidated root here. I don't think that would be a problem since
> > we're just clearing it, but it makes the code more fragile.
>
> Heh, it better not be a problem, because there are plently of flows in the TDP MMU
> that can modify SPTEs under an invalidated root, e.g. fast_page_fault(),
> tdp_mmu_zap_leafs(), kvm_age_gfn(), kvm_test_age_gfn(), etc...  And before the
> patch that introduced is_page_fault_stale(), kvm_tdp_mmu_map() was even installing
> SPTEs into an invalid root!  Anything that takes a reference to a root and yields
> (or never takes mmu_lock) can potentially modify a SPTE under an invalid root.

That's true, I don't think there's really a problem with this commit,
just a different way of dealing with the PTs.

>
> Checking the paging structures for this flow wouldn't change anything.  Invalidating
> a root doesn't immediately zap SPTEs, it just marks the root invalid.  The other
> subtle gotcha is that kvm_reload_remote_mmus() doesn't actually gaurantee all vCPUs
> will have dropped the invalid root or performed a TLB flush when mmu_lock is dropped,
> those guarantees are only with respect to re-entering the guest!
>
> All of the above is no small part of why I don't want to walk the page tables:
> it's completely misleading as walking the page tables doesn't actually provide any
> protection, it's holding RCU that guarantees KVM doesn't write memory it doesn't own.

That's a great point. I was thinking about the RCU protection being
sort of passed down through the RCU dereferences from the root of the
paging structure to whatever SPTE we modify, but since we protect the
SPs with RCU too, dereferencing from them is just as good, I suppose.

>
> > Another approach to this would be to do in-place promotion / in-place
> > splitting once the patch sets David and I just sent out are merged.  That
> > would avoid causing extra page faults here to bring in the page after this
> > zap, but it probably wouldn't be safe if we did it under an invalidated root.
>
> I agree that in-place promotion would be better, but if we do that, I think a logical
> intermediate step would be to stop zapping unrelated roots and entries.  If there's
> a bug that is exposed/introduced by not zapping other stuff, I would much rather it
> show up when KVM stops zapping other stuff, not when KVM stops zapping other stuff
> _and_ promotes in place.  Ditto for if in-place promotion introduces a bug.

That makes sense. I think this is a good first step.

>
> > I'd rather avoid this extra complexity and just tolerate the worse
> > performance on the iTLB multi hit mitigation at this point since new
> > CPUs seem to be moving past that vulnerability.
>
> IMO, it reduces complexity, especially when looking at the series as a whole, which
> I fully realize you haven't yet done :-)  Setting aside the complexities of each
> chunk of code, what I find complex with the current TDP MMU zapping code is that
> there are no precise rules for what needs to be done in each situation.  I'm not
> criticizing how we got to this point, I absolutely think that hitting everything
> with a big hammer to get the initial version stable was the right thing to do.
>
> But a side effect of the big hammer approach is that it makes reasoning about things
> more difficult, e.g. "when is it safe to modify a SPTE versus when is it safe to insert
> a SPTE into the paging structures?" or "what needs to be zapped when the mmu_notifier
> unmaps a range?".
>
> And I also really want to avoid another snafu like the memslots with passthrough
> GPUs bug, where using a big hammer (zap all) when a smaller hammer (zap SPTEs for
> the memslot) _should_ work allows bugs to creep in unnoticed because they're hidden
> by overzealous zapping.
>
> > If you think this is worth the complexity, it'd be nice to do a little
> > benchmarking to make sure it's giving us a substantial improvement.
>
> Performance really isn't a motivating factor.  Per the changelog, the motivation
> is mostly to allow later patches to simplify zap_gfn_range() by having it zap only
> leaf SPTEs, and now that I've typed it up, also all of the above :-)

That makes sense, and addresses all my concerns. Carry on.
Thanks for writing all that up.