On Mon, Apr 15, 2024, David Matlack wrote: > * I suspect the original issue my patch is trying to fix is actually specific > to the way the TDP MMU does eager page splitting and a more targeted fix is > warranted. > > --- > > To evaluate my patch I tested on x86 with different mmu_lock configurations > to simulate other architectures. > > Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y > Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y > Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N > > Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to > add a read_lock/unlock() in fast_page_fault(). > > Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates > LoongArch if LoongArch added support for lockless write-protection fault > handling. > > The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive > write-heavy workload. To compare runs I evaluated 3 metrics: > > * Duration of pre-copy. > * Amount of dirty memory going into post-copy. > * Total CPU usage of CLEAR_DIRTY_LOG. > > The following table shows how each metric changed after adding my patch to drop > mmu_lock during CLEAR_DIRTY_LOG. > > | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU > ---------|------------------|-----------------|--------------------- > Config 1 | -1% | -1% | +6% > Config 2 | -1% | +1% | +123% > Config 3 | +32% | +158% | +5200% > > Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3 > dimensions. ... > If this is all true, then a better / more targeted fix for this issue would be > to drop mmu_lock in the TDP MMU eager page splitting path. For example, we > could limit the "allocate under lock" behavior to only when the read-lock is > held, e.g. > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 7dfdc49a6ade..ea34f8232d97 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > * If this allocation fails we drop the lock and retry with reclaim > * allowed. > */ > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > - if (sp) > - return sp; > + if (shared) { > + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); > + if (sp) > + return sp; > + } > > rcu_read_unlock(); > > I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to > allocate page tables. So I suspect no fix is needed there and this is, in fact, > purely and x86-specific issue. Hmm, it'd be nice if we didn't have to rely on random arch flows to coincidentally do the optimal thing for eager page splitting. Not sure how best to document the "best known practice" though. As for the TDP MMU code, unless the cost of dropping and reacquiring mmu_lock for read is measurable, I would prefer to unconditionally drop mmu_lock, and delete the GFP_NOWAIT allocation. There can be lock contention when mmu_lock is held for read, it's just less common. On a related topic, I think we should take a hard look at the rwlock_needbreak() usage in tdp_mmu_iter_cond_resched(). Because dropping when allocating is really just speculatively dropping mmu_lock because it _might_ be contended, but doing so at a batch size that provides a good balance between doing enough work under mmu_lock and providing low latency for vCPUs. I.e. in theory, we should be able to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere near smart enough and it's currently only for preemptible kernels (or at least, it's supposed to be only for preemptible kernels). Simply yielding on contention is not at all optimal, as evidenced by the whole dynamic preemption debacle[1][2]. The immediate issue was "fixed" by having vCPUs avoid taking mmu_lock, but KVM really shouldn't get into a situation where KVM is pathologically dropping mmu_lock to the point where a non-vCPU action grinds to a halt. The contention logic fails to take into account many things: (1) Is the other task higher priority? (2) Is the other task a vCPU, or something else? (3) Will yielding actually allow the other task to make forward progress? (4) What is the cost of dropping mmu_lock, e.g. is a remote TLB flush needed? (5) What is the expected duration this task is expected to hold mmu_lock? (6) Has this task made enough progress for yielding to be a decent choice? and probably many more than that. As we discussed off-list, I think there are two viable approaches: (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock contention logic, and improve the logic (especially the forward progress guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance in all cases. (b) We completely remove the rwlock_needbreak() checks from tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping mmu_lock in flows where doing so provides the best overall balance, e.g. as in the eager page split case. I don't have a strong preference between (a) and (b), though I think I'd lean towards (b), because it's simpler. My guess is that we can achieve similar performance results with both. E.g. odds are decent that the "best" batch size (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is in the noise when it's not contented. The main argument I see for (b) is that it's simpler, as only code that actually has a justified need to drop mmu_lock does so. The advantage I see with (a) is that it would provide structure and documentation for choosing when to drop mmu_lock (or not). E.g. dropping in the eager page split path makes sense because KVM does so at a large batch size, odds are good that the contending task is a vCPU, there's no TLB flush required, the total hold time of mmu_lock is high, and we know that dropping mmu_lock will allow vCPUs to make forward progress. (a) would do a much better job of capturing all that in code, albeit with quite a bit more complexity. Regardless of which option we choose, I think we should drop the preemptible kernel dependency from the lock contention logic in tdp_mmu_iter_cond_resched(), i.e. manually check if mmu_lock is contented instead of bouncing through rwlock_needbreak(). The current approach essentially means that there's zero testing of the performance of the yield-on-contention logic. E.g. the complaints about the TDP MMU yielding too aggressively only popped up when commit c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode") unintentionally enabled rwlock_needbreak() by default. That's definitely easier said then done though, as I suspect that if we switched to rwlock_is_contended(), i.e. dropped the preemptible requirement, without also enhancing tdp_mmu_iter_cond_resched() to make it smarter as above, we'd see a lot of performance regressions. [1] https://lore.kernel.org/all/20240312193911.1796717-3-seanjc@xxxxxxxxxx [2] https://lore.kernel.org/all/20240222012640.2820927-1-seanjc@xxxxxxxxxx