On Mon, Mar 22, 2021, Ben Gardon wrote: > On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) > > lpage_disallowed_link); > > WARN_ON_ONCE(!sp->lpage_disallowed); > > if (is_tdp_mmu_page(sp)) { > > - kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, > > - sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level)); > > + gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level); > > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end, > > + flush || !list_empty(&invalid_list)); > > } else { > > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > > WARN_ON_ONCE(sp->lpage_disallowed); > > } > > > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { > > - kvm_mmu_commit_zap_page(kvm, &invalid_list); > > + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > > This pattern of waiting until a yield is needed or lock contention is > detected has always been a little suspect to me because > kvm_mmu_commit_zap_page does work proportional to the work done before > the yield was needed. That seems like more work than we should like to > be doing at that point. > > The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even > worse. Because we can satisfy the need to yield without clearing out > the invalid list, we can potentially queue many more pages which will > then all need to have their zaps committed at once. This is an > admittedly contrived case which could only be hit in a high load > nested scenario. > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from > yielding. Since we should only need to zap one SPTE, the yield should > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure > that only one SPTE is zapped we would have to specify the root though. > Otherwise we could end up zapping all the entries for the same GFN > range under an unrelated root. Hmm, I originally did exactly that, but changed my mind because this zaps far more than 1 SPTE. This is zapping a SP that could be huge, but is not, which means it's guaranteed to have a non-zero number of child SPTEs. The worst case scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs. But, I didn't consider the interplay between invalid_list and the TDP MMU yielding. Hrm.