On Thu, Dec 2, 2021 at 10:43 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, Dec 02, 2021, David Matlack wrote: > > On Wed, Dec 1, 2021 at 3:37 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > It's not just the extra objects, it's the overall complexity that bothers me. > > > Complexity isn't really the correct word, it's more that as written, the logic > > > is spread over several files and is disingenuous from the perspective that the > > > split_cache is in kvm->arch, which implies persistence, but the cache are > > > completely torn down after evey memslot split. > > >kmem_cache_alloc > > > I suspect part of the problem is that the code is trying to plan for a future > > > where nested MMUs also support splitting large pages. Usually I'm all for that > > > sort of thing, but in this case it creates a lot of APIs that should not exist, > > > either because the function is not needed at all, or because it's a helper buried > > > in tdp_mmu.c. E.g. assert_split_caches_invariants() is overkill. > > > > > > That's solvable by refactoring and shuffling code, but using kvm_mmu_memory_cache > > > still feels wrong. The caches don't fully solve the might_sleep() problem since > > > the loop still has to drop mmu_lock purely because it needs to allocate memory, > > > > I thought dropping the lock to allocate memory was a good thing. It > > reduces the length of time we hold the RCU read lock and mmu_lock in > > read mode. Plus it avoids the retry-with-reclaim and lets us reuse the > > existing sp allocation code. > > It's not a simple reuse though, e.g. it needs new logic to detect when the caches > are empty, requires a variant of tdp_mmu_iter_cond_resched(), needs its own instance > of caches and thus initialization/destruction of the caches, etc... > > > Eager page splitting itself does not need to be that performant since > > it's not on the critical path of vCPU execution. But holding the MMU > > lock can negatively affect vCPU performance. > > > > But your preference is to allocate without dropping the lock when possible. Why? > > Because they're two different things. Lock contention is already handled by > tdp_mmu_iter_cond_resched(). If mmu_lock is not contended, holding it for a long > duration is a complete non-issue. So I think you are positing that disabling reclaim will make the allocations fast enough that the time between tdp_mmu_iter_cond_resched checks will be acceptable. Is there really no risk of long tail latency in kmem_cache_alloc() or __get_free_page()? Even if it's rare, they will be common at scale. This is why I'm being so hesitant, and prefer to avoid the problem entirely by doing all allocations outside the lock. But I'm honestly more than happy to be convinced otherwise and go with your approach. > > Dropping mmu_lock means restarting the walk at the root because a different task > may have zapped/changed upper level entries. If every allocation is dropping > mmu_lock, that adds up to a lot of extra memory accesses, especially when using > 5-level paging. > > Batching allocations via mmu_caches mostly works around that problem, but IMO > it's more complex overall than the retry-on-failure approach because it bleeds > core details into several locations, e.g. the split logic needs to know intimate > details of kvm_mmu_memory_cache, and we end up with two (or one complex) versions > of tdp_mmu_iter_cond_resched(). > > In general, I also dislike relying on magic numbers (the capacity of the cache) > for performance. At best, we have to justify the magic number, now and in the > future. At worst, someone will have a use case that doesn't play nice with KVM's > chosen magic number and then we have to do more tuning, e.g. see the PTE prefetch > stuff where the magic number of '8' (well, 7) ran out of gas for modern usage. > I don't actually think tuning will be problematic for this case, but I'd rather > avoid the discussion entirely if possible. > > I'm not completely opposed to using kvm_mmu_memory_cache to batch allocations, > but I think we should do so if and only if batching has measurably better > performance for things we care about. E.g. if eager splitting takes n% longer > under heavy memory pressure, but vCPUs aren't impacted, do we care?