On Tue, Nov 30, 2021 at 12:48:24AM +0000, David Matlack wrote: > On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote: > > TDP MMU SPTE zapping process currently uses two levels of iterations. The > > first level iteration happens at the for loop within the zap_gfn_range() > > with the purpose of calibrating the accurate range for zapping. The second > > level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the > > iteration > > > whole paging structures (leaf and non-leaf SPTEs) within the range. The > > former iteration is yield safe, while the second one is not. > > I know what you mean but I'd suggest being more specific than "yield > safe". For example: > > Unlike the outer loop, the recursive zapping done under > tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with > a pre-order traversal, zapping sufficiently large ranges can lead to > RCU stall warnings. > > I'd also clarify here that the TDP MMU iterator uses a pre-order > traversal which causes us KVM to end up doing the maximum amount of > zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop. (Ah sorry for the redundant suggestion. I wrote this paragraph and then reworded my suggested wording to mention the pre-order traversal.) > > > > > In many cases, zapping SPTE process could be optimized since the non-leaf > > SPTEs could most likely be retained for the next allocation. On the other > > hand, for large scale SPTE zapping scenarios, we may end up zapping too > > many SPTEs and use excessive CPU time that causes the RCU stall warning. > > > > The follow selftest reproduces the warning: > > > > (env: kvm.tdp_mmu=Y) > > ./dirty_log_perf_test -v 64 -b 8G > > > > Optimize the zapping process by skipping all SPTEs above a certain level in > > the first iteration. This allows us to control the granularity of the > > actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition, > > we would retain some of the non-leaf SPTEs to accelerate next allocation. > > > > For the selection of the `certain level`, we choose the PG_LEVEL_1G because > > it is currently the largest page size supported and it natually fits the > > scenario of splitting large pages. > > > > For `zap_all` case (usually) at VM teardown time, we use a two-phase > > mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase > > zaps everything else. This is achieved by the helper function > > __zap_gfn_range(). > > > > Cc: Sean Christopherson <seanjc@xxxxxxxxxx> > > Cc: Ben Gardon <bgardon@xxxxxxxxxx> > > Cc: David Matlack <dmatlack@xxxxxxxxxx> > > > > Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx> > > --- > > arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------ > > 1 file changed, 40 insertions(+), 17 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 89d16bb104de..3fadc51c004a 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm, > > * account for the possibility that other threads are modifying the paging > > * structures concurrently. If shared is false, this thread should hold the > > * MMU lock in write mode. > > + * > > + * If zap_all is true, eliminate all the paging structures that contains the > > + * SPTEs. > > */ > > -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > > - gfn_t start, gfn_t end, bool can_yield, bool flush, > > - bool shared) > > +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > > + gfn_t start, gfn_t end, bool can_yield, bool flush, > > + bool shared, bool zap_all) > > { > > - gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT); > > - bool zap_all = (start == 0 && end >= max_gfn_host); > > struct tdp_iter iter; > > > > - /* > > - * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will > > - * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF, > > - * and so KVM will never install a SPTE for such addresses. > > - */ > > - end = min(end, max_gfn_host); > > - > > - kvm_lockdep_assert_mmu_lock_held(kvm, shared); > > - > > rcu_read_lock(); > > > > tdp_root_for_each_pte(iter, root, start, end) { > > @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > > continue; > > } > > > > - if (!is_shadow_present_pte(iter.old_spte)) > > + /* > > + * In zap_all case, ignore the checking of present since we have > > + * to zap everything. > > + */ > > + if (!zap_all && !is_shadow_present_pte(iter.old_spte)) > > continue; > > I don't believe there's any reason to attempt to zap a non-present spte, > even in the zap_all case. In any case, this change deserves its own > patch and a commit message that describes why the old logic is incorrect > and how this fixes it. > > > > > /* > > * If this is a non-last-level SPTE that covers a larger range > > * than should be zapped, continue, and zap the mappings at a > > - * lower level, except when zapping all SPTEs. > > + * lower level. Actual zapping started at proper granularity > > + * that is not so large as to cause a soft lockup when handling > > + * the changed pte (which does not yield). > > */ > > if (!zap_all && > > (iter.gfn < start || > > - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) && > > + iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end || > > + iter.level > PG_LEVEL_1G) && > > !is_last_spte(iter.old_spte, iter.level)) > > continue; > > This if statement is getting a bit long. I'd suggest breaking out the > level check and also using KVM_MAX_HUGEPAGE_LEVEL. > > e.g. > > /* > * If not doing zap_all, only zap up to the huge page level to > * avoid doing too much work in the recursive tdp_mmu_set_spte* > * call below, since it does not yield. > * > * This will potentially leave behind some childless page tables > * but that's ok because ... > */ > if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL) > continue; > > And on that note, what is the reasoning for why it's ok to leave behind > childless page tables? I assume it's because most of the time we'll use > that page table again in the future, and at worst we leave the page > table allocated until the VM is cleaned up? > > > > > @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > > return flush; > > } > > > > +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > > + gfn_t start, gfn_t end, bool can_yield, bool flush, > > + bool shared) > > +{ > > + gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT); > > + bool zap_all = (start == 0 && end >= max_gfn_host); > > + > > + /* > > + * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will > > + * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF, > > + * and so KVM will never install a SPTE for such addresses. > > + */ > > + end = min(end, max_gfn_host); > > + > > + kvm_lockdep_assert_mmu_lock_held(kvm, shared); > > + > > + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared, > > + false); > > + if (zap_all) > > + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, > > + shared, true); > > + return flush; > > +} > > + > > /* > > * Tears down the mappings for the range of gfns, [start, end), and frees the > > * non-root pages mapping GFNs strictly within that range. Returns true if > > -- > > 2.34.0.rc2.393.gf8c9666880-goog > >