On Thu, Aug 22, 2024, David Matlack wrote: > On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Mon, Aug 05, 2024, David Matlack wrote: > > > Recheck the iter.old_spte still points to a page table when recovering > > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up() > > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by > > > another CPU in between stepping down and back up. > > > > > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if > > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page > > > recovery worker, or vCPUs taking faults on the huge page region). > > > > > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and > > > thus can see a different value, which is not immediately obvious when > > > reading the code. > > > > > > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx> > > > --- > > > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > > index 07d5363c9db7..bdc7fd476721 100644 > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm, > > > while (max_mapping_level > iter.level) > > > tdp_iter_step_up(&iter); > > > > > > + /* > > > + * Re-check that iter.old_spte still points to a page table. > > > + * Since mmu_lock is held for read and tdp_iter_step_up() > > > + * re-reads iter.sptep, it's possible the SPTE was zapped or > > > + * recovered by another CPU in between stepping down and > > > + * stepping back up. > > > + */ > > > + if (!is_shadow_present_pte(iter.old_spte) || > > > + is_last_spte(iter.old_spte, iter.level)) > > > + continue; > > > > This is the part of the step-up logic that I do not like. Even this check doesn't > > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was > > used to reach the leaf SPTE. E.g. in an absurdly theoretical situation, the SPTE > > could be zapped and then re-set with another non-leaf SPTE. Which is fine, but > > only because of several very subtle mechanisms. > > I'm not sure why that matters. The only thing that matters is that the > GFN->PFN and permissions cannot change, and that is guaranteed by > holding mmu_lock for read. Because it introduces possible edge cases, that while benign, require the reader to think about and reason through. E.g. if _this_ task is trying to replace a 4KiB page with a 1GiB page, what happens if some other task replaces the parent 2MiB page? It's not immediately obvious that looping on tdp_iter_step_up() will happily blaze past a !PRESENT SPTE, which might not even be the actual SPTE in the tree at this point. > At the end of the day, we never actually care about the value of the > SPTE we are replacing. We only care that it's a non-leaf SPTE. > > > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere > > in the huge range, so it's safe to propagate any and all WRITABLE bits. This > > requires knowing/remembering that KVM disallows huge pages when a gfn is write- > > tracked, and relies on that never changing (which is a fairly safe bet, but the > > behavior isn't fully set in stone). > > not set. > > > > And the presence of a shadow-present leaf SPTE ensures there are no in-flight > > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't > > return until all relevant leaf SPTEs have been zapped. > > As you point out in the next paragraph there could be an inflight invalidate > that yielded, no? Yeah, ignore this, I forgot to amend it after remembering that invalidation can yield.