On 2024-04-12 09:14 AM, David Matlack wrote: > On Thu, Apr 4, 2024 at 11:17 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Thu, Apr 04, 2024, David Matlack wrote: > > > > I don't love the idea of adding more arch specific MMU behavior (going the wrong > > > > direction), but it doesn't seem like an unreasonable approach in this case. > > > > > > I wonder if this is being overly cautious. > > > > Probably. "Lazy" is another word for it ;-) > > > > > I would expect only more benefit on architectures that more aggressively take > > > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU > > > threads, the more this patch will help reduce vCPU starvation during > > > CLEAR_DIRTY_LOG. > > > > > > Hm, perhaps testing with ept=N (which will use the write-lock for even > > > dirty logging faults) would be a way to increase confidence in the > > > effect on other architectures? > > > > Turning off the TDP MMU would be more representative, just manually disable the > > fast-path, e.g. > > Good idea. I'm actually throwing in some writable module parameters > too to make it easy to toggle between configurations. > > I'll report back when I have some data. tl;dr * My patch likely _will_ regress migration performance on other architectures. Thank you Bibo and Sean for keeping me honest here. * I suspect the original issue my patch is trying to fix is actually specific to the way the TDP MMU does eager page splitting and a more targeted fix is warranted. --- To evaluate my patch I tested on x86 with different mmu_lock configurations to simulate other architectures. Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to add a read_lock/unlock() in fast_page_fault(). Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates LoongArch if LoongArch added support for lockless write-protection fault handling. The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive write-heavy workload. To compare runs I evaluated 3 metrics: * Duration of pre-copy. * Amount of dirty memory going into post-copy. * Total CPU usage of CLEAR_DIRTY_LOG. The following table shows how each metric changed after adding my patch to drop mmu_lock during CLEAR_DIRTY_LOG. | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU ---------|------------------|-----------------|--------------------- Config 1 | -1% | -1% | +6% Config 2 | -1% | +1% | +123% Config 3 | +32% | +158% | +5200% Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3 dimensions. Given these regressions, I started rethinking the original issue this patch is trying to fix. The dips in guest performance during CLEAR_DIRTY_LOG occurred during the first pre-copy pass but not during subsequent passes. One thing unique to the first pass is eager page splitting. Ah ha, a theory! The TDP MMU allocates shadow pages while holding the mmu_lock during eager page splitting and only drops the lock if need_resched=True or a GFP_NOWAIT allocation fails. If neither occurs, CLEAR_DIRTY_LOG could potential hold mmu_lock in write-mode for a long time. Second, the host platform where we saw the dips has nx_huge_pages=Y. I suspect the long CLEAR_DIRTY_LOG calls are blocking vCPUs taking steady-state faults for NX Huge Pages, causing the dips in performance. This theory also explains why we (Google) haven't seen similar drops in guest performance when using !manual-protect, as in that configuration the TDP MMU does eager page splitting under the read-lock instead of write-lock. If this is all true, then a better / more targeted fix for this issue would be to drop mmu_lock in the TDP MMU eager page splitting path. For example, we could limit the "allocate under lock" behavior to only when the read-lock is held, e.g. diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7dfdc49a6ade..ea34f8232d97 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, * If this allocation fails we drop the lock and retry with reclaim * allowed. */ - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); - if (sp) - return sp; + if (shared) { + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid); + if (sp) + return sp; + } rcu_read_unlock(); I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to allocate page tables. So I suspect no fix is needed there and this is, in fact, purely and x86-specific issue.