On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Aug 05, 2024, David Matlack wrote: > > Recheck the iter.old_spte still points to a page table when recovering > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up() > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by > > another CPU in between stepping down and back up. > > > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page > > recovery worker, or vCPUs taking faults on the huge page region). > > > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and > > thus can see a different value, which is not immediately obvious when > > reading the code. > > > > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx> > > --- > > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 07d5363c9db7..bdc7fd476721 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm, > > while (max_mapping_level > iter.level) > > tdp_iter_step_up(&iter); > > > > + /* > > + * Re-check that iter.old_spte still points to a page table. > > + * Since mmu_lock is held for read and tdp_iter_step_up() > > + * re-reads iter.sptep, it's possible the SPTE was zapped or > > + * recovered by another CPU in between stepping down and > > + * stepping back up. > > + */ > > + if (!is_shadow_present_pte(iter.old_spte) || > > + is_last_spte(iter.old_spte, iter.level)) > > + continue; > > This is the part of the step-up logic that I do not like. Even this check doesn't > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was > used to reach the leaf SPTE. E.g. in an absurdly theoretical situation, the SPTE > could be zapped and then re-set with another non-leaf SPTE. Which is fine, but > only because of several very subtle mechanisms. I'm not sure why that matters. The only thing that matters is that the GFN->PFN and permissions cannot change, and that is guaranteed by holding mmu_lock for read. At the end of the day, we never actually care about the value of the SPTE we are replacing. We only care that it's a non-leaf SPTE. > > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere > in the huge range, so it's safe to propagate any and all WRITABLE bits. This > requires knowing/remembering that KVM disallows huge pages when a gfn is write- > tracked, and relies on that never changing (which is a fairly safe bet, but the > behavior isn't fully set in stone). > not set. > > And the presence of a shadow-present leaf SPTE ensures there are no in-flight > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't > return until all relevant leaf SPTEs have been zapped. As you point out in the next paragraph there could be an inflight invalidate that yielded, no? > > And even more subtly, recover_huge_pages_range() can install a huge SPTE while > tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB > SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and > tdp_mmu_zap_leafs() yields. That's again safe only because upon regaining > control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap > the new huge SPTE. I agree it's subtle, but only in the sense that the TDP MMU is subtle. Restarting at the root after dropping mmu_lock is a fundamental concept in the TDP MMU. > > So while I'm pretty sure this logic is functionally ok, its safety is extremely > dependent on a number of behaviors in KVM. > > That said, I can't tell which option I dislike less. E.g. we could do something > like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the > primary MMU's PTE/PMD/PUD. Ideally, KVM would use GUP, but then KVM wouldn't > be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory. > > I don't love this either, primarily because not using GUP makes this yet another > custom flow Yeah. I don't like having the huge page recovery path needing its own special way to construct SPTEs from scratch. e.g. I could see this approach becoming a problem if KVM gains support for R/W/X GFN attributes. > i.e. isn't any less tricky than reusing a child SPTE. It does have > the advantage of not having to find a shadow-present child though, i.e. is likely > the most performant option. I agree that that likely doesn't matter in practice, > especially since the raw latency of disabling dirty logging isn't terribly > important. > > rcu_read_lock(); > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { > retry: > if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) > continue; > > if (iter.level > KVM_MAX_HUGEPAGE_LEVEL || > !is_shadow_present_pte(iter.old_spte)) > continue; > > /* > * TODO: this should skip to the end of the parent, because if > * the first SPTE can't be made huger, then no SPTEs at this > * level can be huger. > */ > if (is_last_spte(iter.old_spte, iter.level)) > continue; > > /* > * If iter.gfn resides outside of the slot, i.e. the page for > * the current level overlaps but is not contained by the slot, > * then the SPTE can't be made huge. More importantly, trying > * to query that info from slot->arch.lpage_info will cause an > * out-of-bounds access. > */ > if (iter.gfn < start || iter.gfn >= end) > continue; > > if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable, > &max_mapping_level)) > continue; > > /* > * If the range is being invalidated, do not install a SPTE as > * KVM may have already zapped this specific gfn, e.g. if KVM's > * unmapping has run to completion, but the primary MMU hasn't > * zapped its PTEs. There is no need to check for *past* > * invalidations, because all information is gathered while > * holding mmu_lock, i.e. it can't have become stale due to a > * entire mmu_notifier invalidation sequence completing. > */ > if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn)) > continue; > > /* > * KVM disallows huge pages for write-protected gfns, it should > * impossible for make_spte() to encounter such a gfn since > * write-protecting a gfn requires holding mmu_lock for write. > */ > flush |= __tdp_mmu_map_gfn(...); > WARN_ON_ONCE(r == RET_PF_EMULATE); > } > > rcu_read_unlock(); > > Assuming you don't like the above idea (I'm not sure I do), what if instead of > doing the step-up, KVM starts a second iteration using the shadow page it wants > to replace as the root of the walk? > > This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the > ordering with respect to an mmu_notifier invalidation, but it at least avoids > having to reason about the correctness of re-reading a SPTE and modifying the > iteration level within the body of an interation loop. > > It should also yield smaller diffs overall, e.g. no revert, no separate commit > to recheck the SPTE, etc. And I believe it's more performant that the step-up > approach when there are SPTE that _can't_ be huge, as KVM won't traverse down > into the leafs in that case. This approach looks good to me. I'll try it out and see if I run into any issues. > > An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in > &flush, but I think that ends up being more confusing and harder to follow. Yeah I think that would be more complicated. If we drop mmu_lock then we need to re-check kvm_mmu_max_mapping_level() and restart at the root. > > static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent, > u64 *huge_spte) > { > struct kvm_mmu_page *root = sptep_to_sp(parent->sptep); > gfn_t start = parent->gfn; > gfn_t end = start + ???; /* use parent's level */ > struct tdp_iter iter; > > tdp_root_for_each_leaf_pte(iter, root, start, end) { > /* > * Use the parent iterator when checking for forward progress, > * so that KVM doesn't get stuck due to always yielding while > * walking child SPTEs. > */ > if (tdp_mmu_iter_needs_reched(kvm, parent)) > return -EAGAIN; > > *huge_spte = make_huge_spte(kvm, iter.old_spte); > return 0; > } > > return -ENOENT; > } > > static void recover_huge_pages_range(struct kvm *kvm, > struct kvm_mmu_page *root, > const struct kvm_memory_slot *slot) > { > gfn_t start = slot->base_gfn; > gfn_t end = start + slot->npages; > struct tdp_iter iter; > int max_mapping_level; > bool flush = false; > u64 huge_spte; > > if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot))) > return; > > rcu_read_lock(); > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { > restart: > if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) { > flush = false; > continue; > } > > if (iter.level > KVM_MAX_HUGEPAGE_LEVEL || > !is_shadow_present_pte(iter.old_spte)) > continue; > > /* > * If iter.gfn resides outside of the slot, i.e. the page for > * the current level overlaps but is not contained by the slot, > * then the SPTE can't be made huge. More importantly, trying > * to query that info from slot->arch.lpage_info will cause an > * out-of-bounds access. > */ > if (iter.gfn < start || iter.gfn >= end) > continue; > > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn); > if (max_mapping_level < iter.level) > continue; > > r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte); > if (r == -EAGAIN) > goto restart; > else if (r) > continue; > > /* > * If setting the SPTE fails, e.g. because it has been modified > * by a different task, iteration will naturally continue with > * the next SPTE. Don't bother retrying this SPTE, races are > * uncommon and odds are good the SPTE > */ > if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte)) > flush = true; > } > > if (flush) > kvm_flush_remote_tlbs_memslot(kvm, slot); > > rcu_read_unlock(); > } > > static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm, > struct tdp_iter *iter) > { > /* Ensure forward progress has been made before yielding. */ > return iter->next_last_level_gfn != iter->yielded_gfn && > (need_resched() || rwlock_needbreak(&kvm->mmu_lock)); > > } > > /* > * Yield if the MMU lock is contended or this thread needs to return control > * to the scheduler. > * > * If this function should yield and flush is set, it will perform a remote > * TLB flush before yielding. > * > * If this function yields, iter->yielded is set and the caller must skip to > * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk > * over the paging structures to allow the iterator to continue its traversal > * from the paging structure root. > * > * Returns true if this function yielded. > */ > static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm, > struct tdp_iter *iter, > bool flush, bool shared) > { > WARN_ON_ONCE(iter->yielded); > > if (!tdp_mmu_iter_needs_reched(kvm, iter)) > return false; > > if (flush) > kvm_flush_remote_tlbs(kvm); > > rcu_read_unlock(); > > if (shared) > cond_resched_rwlock_read(&kvm->mmu_lock); > else > cond_resched_rwlock_write(&kvm->mmu_lock); > > rcu_read_lock(); > > WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn); > > iter->yielded = true; > return true; > }