Re: [PATCH 7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 22, 2024, David Matlack wrote:
> On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Mon, Aug 05, 2024, David Matlack wrote:
> > > Recheck the iter.old_spte still points to a page table when recovering
> > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> > > another CPU in between stepping down and back up.
> > >
> > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> > > recovery worker, or vCPUs taking faults on the huge page region).
> > >
> > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> > > thus can see a different value, which is not immediately obvious when
> > > reading the code.
> > >
> > > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 07d5363c9db7..bdc7fd476721 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
> > >               while (max_mapping_level > iter.level)
> > >                       tdp_iter_step_up(&iter);
> > >
> > > +             /*
> > > +              * Re-check that iter.old_spte still points to a page table.
> > > +              * Since mmu_lock is held for read and tdp_iter_step_up()
> > > +              * re-reads iter.sptep, it's possible the SPTE was zapped or
> > > +              * recovered by another CPU in between stepping down and
> > > +              * stepping back up.
> > > +              */
> > > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > > +                 is_last_spte(iter.old_spte, iter.level))
> > > +                     continue;
> >
> > This is the part of the step-up logic that I do not like.  Even this check doesn't
> > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
> > used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
> > could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
> > only because of several very subtle mechanisms.
> 
> I'm not sure why that matters. The only thing that matters is that the
> GFN->PFN and permissions cannot change, and that is guaranteed by
> holding mmu_lock for read.

Because it introduces possible edge cases, that while benign, require the reader
to think about and reason through.  E.g. if _this_ task is trying to replace a
4KiB page with a 1GiB page, what happens if some other task replaces the parent
2MiB page?  It's not immediately obvious that looping on tdp_iter_step_up() will
happily blaze past a !PRESENT SPTE, which might not even be the actual SPTE in
the tree at this point.

> At the end of the day, we never actually care about the value of the
> SPTE we are replacing. We only care that it's a non-leaf SPTE.
>
> > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
> > in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
> > requires knowing/remembering that KVM disallows huge pages when a gfn is write-
> > tracked, and relies on that never changing (which is a fairly safe bet, but the
> > behavior isn't fully set in stone).
> > not set.
> >
> > And the presence of a shadow-present leaf SPTE ensures there are no in-flight
> > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
> > return until all relevant leaf SPTEs have been zapped.
> 
> As you point out in the next paragraph there could be an inflight invalidate
> that yielded, no?

Yeah, ignore this, I forgot to amend it after remembering that invalidation can
yield.





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux