On Tue, Feb 1, 2022 at 10:03 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > On 1/20/22 00:07, David Matlack wrote: > > This series implements Eager Page Splitting for the TDP MMU. > > > > "Eager Page Splitting" is an optimization that has been in use in Google > > Cloud since 2016 to reduce the performance impact of live migration on > > customer workloads. It was originally designed and implemented by Peter > > Feiner <pfeiner@xxxxxxxxxx>. > > > > For background and performance motivation for this feature, please > > see "RFC: KVM: x86/mmu: Eager Page Splitting" [1]. > > > > Implementation > > ============== > > > > This series implements support for splitting all huge pages mapped by > > the TDP MMU. Pages mapped by the shadow MMU are not split, although I > > plan to add the support in a future patchset. > > > > Eager page splitting is triggered in two ways: > > > > - KVM_SET_USER_MEMORY_REGION ioctl: If this ioctl is invoked to enable > > dirty logging on a memslot and KVM_DIRTY_LOG_INITIALLY_SET is not > > enabled, KVM will attempt to split all huge pages in the memslot down > > to the 4K level. > > > > - KVM_CLEAR_DIRTY_LOG ioctl: If this ioctl is invoked and > > KVM_DIRTY_LOG_INITIALLY_SET is enabled, KVM will attempt to split all > > huge pages cleared by the ioctl down to the 4K level before attempting > > to write-protect them. > > > > Eager page splitting is enabled by default in both paths but can be > > disabled via module param eager_page_split=N. > > > > Splitting for pages mapped by the TDP MMU is done under the MMU lock in > > read mode. The lock is dropped and the thread rescheduled if contention > > or need_resched() is detected. > > > > To allocate memory for the lower level page tables, we attempt to > > allocate without dropping the MMU lock using GFP_NOWAIT to avoid doing > > direct reclaim or invoking filesystem callbacks. If that fails we drop > > the lock and perform a normal GFP_KERNEL allocation. > > > > Performance > > =========== > > > > Eager page splitting moves the cost of splitting huge pages off of the > > vCPU thread and onto the thread invoking one of the aforementioned > > ioctls. This is useful because: > > > > - Splitting on the vCPU thread interrupts vCPUs execution and is > > disruptive to customers whereas splitting on VM ioctl threads can > > run in parallel with vCPU execution. > > > > - Splitting on the VM ioctl thread is more efficient because it does > > no require performing VM-exit handling and page table walks for every > > 4K page. > > > > The measure the performance impact of Eager Page Splitting I ran > > dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB > > HugeTLB memory. > > > > When KVM_DIRTY_LOG_INITIALLY_SET is set, we can see that the first > > KVM_CLEAR_DIRTY_LOG iteration gets longer because KVM is splitting > > huge pages. But the time it takes for vCPUs to dirty their memory > > is significantly shorter since they do not have to take write- > > protection faults. > > > > | Iteration 1 clear dirty log time | Iteration 2 dirty memory time > > ---------- | -------------------------------- | ----------------------------- > > Before | 0.049572219s | 2.751442902s > > After | 1.667811687s | 0.127016504s > > > > Eager page splitting does make subsequent KVM_CLEAR_DIRTY_LOG ioctls > > about 4% slower since it always walks the page tables looking for pages > > to split. This can be avoided but will require extra memory and/or code > > complexity to track when splitting can be skipped. > > > > | Iteration 3 clear dirty log time > > ---------- | -------------------------------- > > Before | 1.374501209s > > After | 1.422478617s > > > > When not using KVM_DIRTY_LOG_INITIALLY_SET, KVM performs splitting on > > the entire memslot during the KVM_SET_USER_MEMORY_REGION ioctl that > > enables dirty logging. We can see that as an increase in the time it > > takes to enable dirty logging. This allows vCPUs to avoid taking > > write-protection faults which we again see in the dirty memory time. > > > > | Enabling dirty logging time | Iteration 1 dirty memory time > > ---------- | -------------------------------- | ----------------------------- > > Before | 0.001683739s | 2.943733325s > > After | 1.546904175s | 0.145979748s > > > > Testing > > ======= > > > > - Ran all kvm-unit-tests and KVM selftests on debug and non-debug kernels. > > > > - Ran dirty_log_perf_test with different backing sources (anonymous, > > anonymous_thp, anonymous_hugetlb_2mb, anonymous_hugetlb_1gb) with and > > without Eager Page Splitting enabled. > > > > - Added a tracepoint locally to time the GFP_NOWAIT allocations. Across > > 40 runs of dirty_log_perf_test using 1GiB HugeTLB with 96 vCPUs there > > were only 4 allocations that took longer than 20 microseconds and the > > longest was 60 microseconds. None of the GFP_NOWAIT allocations > > failed. > > > > - I have not been able to trigger a GFP_NOWAIT allocation failure (to > > exercise the fallback path). However I did manually modify the code > > to force every allocation to fallback by removing the GFP_NOWAIT > > allocation altogether to make sure the logic works correctly. > > > > - Live migrated a 32 vCPU 32 GiB Linux VM running a workload that > > aggressively writes to guest memory with Eager Page Splitting enabled. > > Observed pages being split via tracepoint and the pages_{4k,2m,1g} > > stats. > > Queued, thanks! Thanks Paolo! I should have the shadow MMU implementation out for review shortly. > > Paolo > > > Version Log > > =========== > > > > v2: > > > > [Overall Changes] > > - Additional testing by live migrating a Linux VM (see above). > > - Add Sean's, Ben's, and Peter's Reviewed-by tags. > > - Use () when referring to functions in commit message and comments [Sean] > > - Add TDP MMU to shortlog where applicable [Sean] > > - Fix gramatical errors in commit messages [Sean] > > - Break 80+ char function declarations across multiple lines [Sean] > > > > [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails > > - Remove useless empty line [Peter] > > - Tighten up the wording in comments [Sean] > > - Move result of rcu_dereference() to a local variable to cut down line lengths [Sean] > > > > [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table > > - Add prep patch to return 0/-EBUSY instead of bool [Sean] > > - Add prep patch to rename {un,}link_page to {un,}link_sp [Sean] > > - Fold tdp_mmu_link_page() into tdp_mmu_install_sp_atomic() [Sean] > > > > [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c > > - Make inline [Sean] > > - Eliminate WARN_ON_ONCE [Sean] > > - Eliminate unnecessary local variable new_spte [Sean]. > > > > [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root > > - Eliminate unnecessary local variable root_pt [Sean] > > > > [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent > > - Eliminate redundant role overrides [Sean] > > > > [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization > > - Rename alloc_tdp_mmu_page*() functions [Sean] > > > > [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled > > - Drop access from make_huge_page_split_spte() [Sean] > > - Drop is_mmio_spte() check from make_huge_page_split_spte() [Sean] > > - Change WARN_ON to WARN_ON_ONCE in make_huge_page_split_spte() [Sean] > > - Improve comment for making 4K SPTEs executable [Sean] > > - Rename mark_spte_executable() to mark_spte_executable() [Sean] > > - Put return type on same line as tdp_mmu_split_huge_page_atomic() [Sean] > > - Drop child_spte local variable in tdp_mmu_split_huge_page_atomic() [Sean] > > - Make alloc_tdp_mmu_page_for_split() play nice with > > commit 3a0f64de479c ("KVM: x86/mmu: Don't advance iterator after restart due to yielding") [Sean] > > - Free unused sp after dropping RCU [Sean] > > - Rename module param to something shorter [Sean] > > - Document module param somewhere [Sean] > > - Fix rcu_read_unlock in tdp_mmu_split_huge_pages_root() [me] > > - Document TLB flush behavior [Peter] > > > > [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked > > - Drop [Peter] > > > > [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG > > - Hold the lock in write-mode when splitting [Peter] > > - Document TLB flush behavior [Peter] > > > > [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages > > - Record if split succeeded or failed [Sean] > > > > v1: https://lore.kernel.org/kvm/20211213225918.672507-1-dmatlack@xxxxxxxxxx/ > > > > [Overall Changes] > > - Use "huge page" instead of "large page" [Sean Christopherson] > > > > [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect > > - Add Ben's Reviewed-by. > > - Add Peter's Reviewed-by. > > > > [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails > > - Add comment when updating old_spte [Ben Gardon] > > - Follow kernel style of else case in zap_gfn_range [Ben Gardon] > > - Don't delete old_spte update after zapping in kvm_tdp_mmu_map [me] > > > > [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table > > - Add blurb to commit message describing intentional drop of tracepoint [Ben Gardon] > > - Consolidate "u64 spte = make_nonleaf_spte(...);" onto one line [Sean Christopherson] > > - Do not free the sp if set fails [Sean Christopherson] > > > > [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct > > - Drop to adopt Sean's proposed allocation scheme. > > > > [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent > > - No changes. > > > > [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu > > - Drop to adopt Sean's proposed allocation scheme. > > > > [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes > > - Drop this commit and the helper function [Sean Christopherson] > > > > [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c > > - Add Ben's Reviewed-by. > > > > [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched > > - Drop to adopt Sean's proposed allocation scheme. > > > > [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root > > - Add Ben's Reviewed-by. > > > > [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled > > - Add a module parameter to control Eager Page Splitting [Peter Xu] > > - Change level to large_spte_level [Ben Gardon] > > - Get rid of BUG_ONs [Ben Gardon] > > - Change += to |= and add a comment [Ben Gardon] > > - Do not flush TLBs when dropping the MMU lock. [Sean Christopherson] > > - Allocate memory directly from the kernel instead of using mmu_caches [Sean Christopherson] > > > > [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG > > - Fix deadlock by refactoring MMU locking and dropping write lock before splitting. [kernel test robot] > > - Did not follow Sean's suggestion of skipping write-protection if splitting > > succeeds as it would require extra complexity since we aren't splitting > > pages in the shadow MMU yet. > > > > [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages > > - No changes. > > > > [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages > > - Squash into patch that first introduces page splitting. > > > > Note: I opted not to change TDP MMU functions to return int instead of > > bool per Sean's suggestion. I agree this change should be done but can > > be left to a separate series. > > > > RFC: https://lore.kernel.org/kvm/20211119235759.1304274-1-dmatlack@xxxxxxxxxx/ > > > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@xxxxxxxxxxxxxx/#t > > > > David Matlack (18): > > KVM: x86/mmu: Rename rmap_write_protect() to > > kvm_vcpu_write_protect_gfn() > > KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() > > KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails > > KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return > > 0/-EBUSY > > KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages > > KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to > > handle_removed_pt() > > KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU > > page table > > KVM: x86/mmu: Remove unnecessary warnings from > > restore_acc_track_spte() > > KVM: x86/mmu: Drop new_spte local variable from > > restore_acc_track_spte() > > KVM: x86/mmu: Move restore_acc_track_spte() to spte.h > > KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root > > KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages > > KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent > > KVM: x86/mmu: Separate TDP MMU shadow page allocation and > > initialization > > KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty > > logging is enabled > > KVM: x86/mmu: Split huge pages mapped by the TDP MMU during > > KVM_CLEAR_DIRTY_LOG > > KVM: x86/mmu: Add tracepoint for splitting huge pages > > KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and > > INITIALLY_SET > > > > .../admin-guide/kernel-parameters.txt | 26 ++ > > arch/x86/include/asm/kvm_host.h | 7 + > > arch/x86/kvm/mmu/mmu.c | 79 ++-- > > arch/x86/kvm/mmu/mmutrace.h | 23 + > > arch/x86/kvm/mmu/spte.c | 59 +++ > > arch/x86/kvm/mmu/spte.h | 16 + > > arch/x86/kvm/mmu/tdp_iter.c | 8 +- > > arch/x86/kvm/mmu/tdp_iter.h | 10 +- > > arch/x86/kvm/mmu/tdp_mmu.c | 419 +++++++++++++----- > > arch/x86/kvm/mmu/tdp_mmu.h | 5 + > > arch/x86/kvm/x86.c | 6 + > > arch/x86/kvm/x86.h | 2 + > > .../selftests/kvm/dirty_log_perf_test.c | 13 +- > > 13 files changed, 520 insertions(+), 153 deletions(-) > > > > > > base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33 >