This series is the result of digging into why deleting a memslot, which on x86 forces all vCPUs to reload a new MMU root, causes noticeably more jitter in vCPUs and other tasks when running with the TDP MMU than the Shadow MMU (with TDP enabled). Patch 1 addresses the most obvious issue by simply zapping at a finer granularity so that if a different task, e.g. a vCPU, wants to run on the pCPU doing the zapping, it doesn't have to wait for KVM to zap an entire 1GiB region, which can take a hundreds of microseconds (or more). The shadow MMU checks for need_resched() (and mmu_lock contention, see below) every 10 zaps, which is why the shadow MMU doesn't induce the same level of jitter. On preemptible kernels, zapping at 4KiB granularity will also cause the zapping task to yield mmu_lock much more aggressively if a writer comes along. That _sounds_ like a good thing, and most of the time it is, but sometimes bouncing mmu_lock can be a big net negative: https://lore.kernel.org/all/20240110012045.505046-1-seanjc@xxxxxxxxxx While trying to figure out whether or not frequently yielding mmu_lock would be a negative or positive, I ran into extremely high latencies for loading TDP MMU roots on VMs with large-ish numbers of vCPUs, e.g. a vCPU could end up taking more than a second to Long story short, the issue is that the TDP MMU acquires mmu_lock for write when unloading roots, and again when loading a "new" root (in quotes because most vCPUs end up loading an existing root). With a decent number of vCPUs, that results in a _lot_ mmu_lock contention, as every vCPU will take and release mmu_lock for write to unload its roots, and then again to load a new root. Due to rwlock's fairness (waiting writers block new readers), the contention can result in rather nasty worst case scenarios. Patches 6-8 fix the issues by taking mmu_lock for read. The free path is very straightforward and doesn't require any new protection (IIRC, the only reason we didn't pursue this when reworking the TDP MMU zapping back at the end of 2021 was because we had bigger issues to solve). Allocating a new root with mmu_lock held for read is a little harder, but still fairly easy. KVM only needs to ensure that it doesn't create duplicate roots, because everything that needs mmu_lock to ensure ordering must take mmu_lock for write, i.e. is still mutually exclusive with new roots coming along. Patches 2-5 are small cleanups to avoid doing work for invalid roots, e.g. when zapping SPTEs purely to affect guest behavior, there's no need to zap invalid roots because they are unreachable from the guest. All told, this significantly reduces mmu_lock contention when doing a fast zap, i.e. when deleting memslots, and takes the worst case latency for a vCPU to load a new root from >3ms to <100us for large-ish VMs (100+ vCPUs) For small and medium sized VMs (<24 vCPUs), the vast majority of loads takes less than 1us, with the worst case being <10us, versus >200us without this series. Note, I did all of the latency testing before the holidays, and then managed to lose almost all of my notes, which is why I don't have more precise data on the exact setups and latency bins. /facepalm Sean Christopherson (8): KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read arch/x86/kvm/mmu/mmu.c | 33 +++++++--- arch/x86/kvm/mmu/tdp_mmu.c | 124 ++++++++++++++++++++++++++----------- arch/x86/kvm/mmu/tdp_mmu.h | 2 +- 3 files changed, 111 insertions(+), 48 deletions(-) base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15 -- 2.43.0.275.g3460e3d667-goog