On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote: > Dynamically size struct pte_list_desc's array of sptes based on whether > or not KVM is using TDP. Commit dc1cff969101 ("KVM: X86: MMU: Tune > PTE_LIST_EXT to be bigger") bumped the number of entries in order to > improve performance when using shadow paging, but its analysis that the > larger size would not affect TDP was wrong. Consuming pte_list_desc > objects for nested TDP is indeed rare, but _allocating_ objects is not, > as KVM allocates 40 objects for each per-vCPU cache. Reducing the size > from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to > 1280, and also provides similar savings when eager page splitting for > nested MMUs kicks in. > > The per-vCPU overhead could be further reduced by using a custom, smaller > capacity for the per-vCPU caches, but that's more of an "and" than > an "or" change, e.g. it wouldn't help the eager page split use case. > > Set the list size to the bare minimum without completely defeating the > purpose of an array (and because pte_list_add() assumes the array is at > least two entries deep). A larger size, e.g. 4, would reduce the number > of "allocations", but those "allocations" only become allocations in > truth if a single vCPU depletes its cache to where a topup is needed, > i.e. if a single vCPU "allocates" 30+ lists. Conversely, those 2 extra > entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested > TDP is used. > > In the unlikely event that performance of aliased gfns for nested TDP > really is (or becomes) a priority for oddball workloads, KVM could add a > knob to let the admin tune the array size for their environment. > > Note, KVM also unnecessarily tops up the per-vCPU caches even when not > using rmaps; this can also be addressed separately. The only possible way of using pte_list_desc when tdp=1 is when the hypervisor tries to map the same host pages with different GPAs? And we don't really have a real use case of that, or.. do we? Sorry to start with asking questions, it's just that if we know that pte_list_desc is probably not gonna be used then could we simply skip the cache layer as a whole? IOW, we don't make the "array size of pte list desc" dynamic, instead we make the whole "pte list desc cache layer" dynamic. Is it possible? -- Peter Xu