On Mon, Apr 15, 2024 at 1:00 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Apr 15, 2024, David Matlack wrote: > > > If this is all true, then a better / more targeted fix for this issue would be > > to drop mmu_lock in the TDP MMU eager page splitting path. For example, we > > could limit the "allocate under lock" behavior to only when the read-lock is > > held, e.g. > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 7dfdc49a6ade..ea34f8232d97 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > > * If this allocation fails we drop the lock and retry with reclaim > > * allowed. > > */ > > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > > - if (sp) > > - return sp; > > + if (shared) { > > + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > > + if (sp) > > + return sp; > > + } > > > > rcu_read_unlock(); > > > > I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to > > allocate page tables. So I suspect no fix is needed there and this is, in fact, > > purely and x86-specific issue. > > Hmm, it'd be nice if we didn't have to rely on random arch flows to coincidentally > do the optimal thing for eager page splitting. Not sure how best to document the > "best known practice" though. We discussed upstreaming Google's mmu_lock timing stats yesterday. Once that's in place, KVM could WARN_ON_ONCE() if the mmu_lock is held for an excessive amount of time to help flag these kinds of issues. That might trigger false positives though, as holding mmu_lock for a long time is no big deal if there's no contention. > > As for the TDP MMU code, unless the cost of dropping and reacquiring mmu_lock for > read is measurable, I would prefer to unconditionally drop mmu_lock, and delete > the GFP_NOWAIT allocation. There can be lock contention when mmu_lock is held > for read, it's just less common. SGTM. I'll do some testing. Unfortunately, the original MySQL workload that led to this patch has bitrotted so I'm having some trouble reproducing the original results and confirming the TDP MMU fix. Sigh. > > On a related topic, I think we should take a hard look at the rwlock_needbreak() > usage in tdp_mmu_iter_cond_resched(). Because dropping when allocating is really > just speculatively dropping mmu_lock because it _might_ be contended, but doing > so at a batch size that provides a good balance between doing enough work under > mmu_lock and providing low latency for vCPUs. I.e. in theory, we should be able > to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere > near smart enough and it's currently only for preemptible kernels (or at least, > it's supposed to be only for preemptible kernels). Dropping the lock for allocating is also to drop GFP_NOWAIT, i.e. to allow direct reclaim and other blocking operations. This is valuable for "cry-for-help" type migrations where the host is under intense memory pressure. I'd rather do the reclaim on the eager page splitting thread than a vCPU. But I agree that the rwlock_needbreak() checks are pretty much untested and likely super nonoptimal. > > Simply yielding on contention is not at all optimal, as evidenced by the whole > dynamic preemption debacle[1][2]. The immediate issue was "fixed" by having vCPUs > avoid taking mmu_lock, but KVM really shouldn't get into a situation where KVM is > pathologically dropping mmu_lock to the point where a non-vCPU action grinds to > a halt. > > The contention logic fails to take into account many things: > > (1) Is the other task higher priority? > > (2) Is the other task a vCPU, or something else? > > (3) Will yielding actually allow the other task to make forward progress? > > (4) What is the cost of dropping mmu_lock, e.g. is a remote TLB flush needed? > > (5) What is the expected duration this task is expected to hold mmu_lock? > > (6) Has this task made enough progress for yielding to be a decent choice? > > and probably many more than that. > > As we discussed off-list, I think there are two viable approaches: > > (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock > contention logic, and improve the logic (especially the forward progress > guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance > in all cases. The only way I can think of to universally measure forward progress would be by wall time. Again that becomes more possible with the mmu_lock timing stats. But we'll have to hand-pick some thresholds and that feels wrong... > > (b) We completely remove the rwlock_needbreak() checks from > tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping > mmu_lock in flows where doing so provides the best overall balance, e.g. as > in the eager page split case. > > I don't have a strong preference between (a) and (b), though I think I'd lean > towards (b), because it's simpler. My guess is that we can achieve similar > performance results with both. E.g. odds are decent that the "best" batch size > (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is > in the noise when it's not contented. > > The main argument I see for (b) is that it's simpler, as only code that actually > has a justified need to drop mmu_lock does so. The advantage I see with (a) is > that it would provide structure and documentation for choosing when to drop > mmu_lock (or not). I need to think it through more but I'm leaning toward (b) and use the mmu_lock stats to flag potential flows that are holding the lock too long. With (b) we can make each flow incrementally better and don't have to pick any magic numbers. > > E.g. dropping in the eager page split path makes sense because KVM does so at a > large batch size, odds are good that the contending task is a vCPU, there's no > TLB flush required, the total hold time of mmu_lock is high, and we know that > dropping mmu_lock will allow vCPUs to make forward progress. (a) would do a much > better job of capturing all that in code, albeit with quite a bit more complexity. > > Regardless of which option we choose, I think we should drop the preemptible kernel > dependency from the lock contention logic in tdp_mmu_iter_cond_resched(), i.e. > manually check if mmu_lock is contented instead of bouncing through rwlock_needbreak(). +1 > > The current approach essentially means that there's zero testing of the performance > of the yield-on-contention logic. E.g. the complaints about the TDP MMU yielding > too aggressively only popped up when commit c597bfddc9e9 ("sched: Provide Kconfig > support for default dynamic preempt mode") unintentionally enabled > rwlock_needbreak() by default. > > That's definitely easier said then done though, as I suspect that if we switched > to rwlock_is_contended(), i.e. dropped the preemptible requirement, without also > enhancing tdp_mmu_iter_cond_resched() to make it smarter as above, we'd see a lot > of performance regressions. > > [1] https://lore.kernel.org/all/20240312193911.1796717-3-seanjc@xxxxxxxxxx > [2] https://lore.kernel.org/all/20240222012640.2820927-1-seanjc@xxxxxxxxxx