Re: [PATCH v2 00/18] KVM: x86/mmu: Eager Page Splitting for the TDP MMU

David Matlack <dmatlack@xxxxxxxxxx> · Tue, 1 Feb 2022 10:24:02 -0800

On Tue, Feb 1, 2022 at 10:03 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 1/20/22 00:07, David Matlack wrote:
> > This series implements Eager Page Splitting for the TDP MMU.
> >
> > "Eager Page Splitting" is an optimization that has been in use in Google
> > Cloud since 2016 to reduce the performance impact of live migration on
> > customer workloads. It was originally designed and implemented by Peter
> > Feiner <pfeiner@xxxxxxxxxx>.
> >
> > For background and performance motivation for this feature, please
> > see "RFC: KVM: x86/mmu: Eager Page Splitting" [1].
> >
> > Implementation
> > ==============
> >
> > This series implements support for splitting all huge pages mapped by
> > the TDP MMU. Pages mapped by the shadow MMU are not split, although I
> > plan to add the support in a future patchset.
> >
> > Eager page splitting is triggered in two ways:
> >
> > - KVM_SET_USER_MEMORY_REGION ioctl: If this ioctl is invoked to enable
> >    dirty logging on a memslot and KVM_DIRTY_LOG_INITIALLY_SET is not
> >    enabled, KVM will attempt to split all huge pages in the memslot down
> >    to the 4K level.
> >
> > - KVM_CLEAR_DIRTY_LOG ioctl: If this ioctl is invoked and
> >    KVM_DIRTY_LOG_INITIALLY_SET is enabled, KVM will attempt to split all
> >    huge pages cleared by the ioctl down to the 4K level before attempting
> >    to write-protect them.
> >
> > Eager page splitting is enabled by default in both paths but can be
> > disabled via module param eager_page_split=N.
> >
> > Splitting for pages mapped by the TDP MMU is done under the MMU lock in
> > read mode. The lock is dropped and the thread rescheduled if contention
> > or need_resched() is detected.
> >
> > To allocate memory for the lower level page tables, we attempt to
> > allocate without dropping the MMU lock using GFP_NOWAIT to avoid doing
> > direct reclaim or invoking filesystem callbacks. If that fails we drop
> > the lock and perform a normal GFP_KERNEL allocation.
> >
> > Performance
> > ===========
> >
> > Eager page splitting moves the cost of splitting huge pages off of the
> > vCPU thread and onto the thread invoking one of the aforementioned
> > ioctls. This is useful because:
> >
> >   - Splitting on the vCPU thread interrupts vCPUs execution and is
> >     disruptive to customers whereas splitting on VM ioctl threads can
> >     run in parallel with vCPU execution.
> >
> >   - Splitting on the VM ioctl thread is more efficient because it does
> >     no require performing VM-exit handling and page table walks for every
> >     4K page.
> >
> > The measure the performance impact of Eager Page Splitting I ran
> > dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB
> > HugeTLB memory.
> >
> > When KVM_DIRTY_LOG_INITIALLY_SET is set, we can see that the first
> > KVM_CLEAR_DIRTY_LOG iteration gets longer because KVM is splitting
> > huge pages. But the time it takes for vCPUs to dirty their memory
> > is significantly shorter since they do not have to take write-
> > protection faults.
> >
> >             | Iteration 1 clear dirty log time | Iteration 2 dirty memory time
> > ---------- | -------------------------------- | -----------------------------
> > Before     | 0.049572219s                     | 2.751442902s
> > After      | 1.667811687s                     | 0.127016504s
> >
> > Eager page splitting does make subsequent KVM_CLEAR_DIRTY_LOG ioctls
> > about 4% slower since it always walks the page tables looking for pages
> > to split.  This can be avoided but will require extra memory and/or code
> > complexity to track when splitting can be skipped.
> >
> >             | Iteration 3 clear dirty log time
> > ---------- | --------------------------------
> > Before     | 1.374501209s
> > After      | 1.422478617s
> >
> > When not using KVM_DIRTY_LOG_INITIALLY_SET, KVM performs splitting on
> > the entire memslot during the KVM_SET_USER_MEMORY_REGION ioctl that
> > enables dirty logging. We can see that as an increase in the time it
> > takes to enable dirty logging. This allows vCPUs to avoid taking
> > write-protection faults which we again see in the dirty memory time.
> >
> >             | Enabling dirty logging time      | Iteration 1 dirty memory time
> > ---------- | -------------------------------- | -----------------------------
> > Before     | 0.001683739s                     | 2.943733325s
> > After      | 1.546904175s                     | 0.145979748s
> >
> > Testing
> > =======
> >
> > - Ran all kvm-unit-tests and KVM selftests on debug and non-debug kernels.
> >
> > - Ran dirty_log_perf_test with different backing sources (anonymous,
> >    anonymous_thp, anonymous_hugetlb_2mb, anonymous_hugetlb_1gb) with and
> >    without Eager Page Splitting enabled.
> >
> > - Added a tracepoint locally to time the GFP_NOWAIT allocations. Across
> >    40 runs of dirty_log_perf_test using 1GiB HugeTLB with 96 vCPUs there
> >    were only 4 allocations that took longer than 20 microseconds and the
> >    longest was 60 microseconds. None of the GFP_NOWAIT allocations
> >    failed.
> >
> > - I have not been able to trigger a GFP_NOWAIT allocation failure (to
> >    exercise the fallback path). However I did manually modify the code
> >    to force every allocation to fallback by removing the GFP_NOWAIT
> >    allocation altogether to make sure the logic works correctly.
> >
> > - Live migrated a 32 vCPU 32 GiB Linux VM running a workload that
> >    aggressively writes to guest memory with Eager Page Splitting enabled.
> >    Observed pages being split via tracepoint and the pages_{4k,2m,1g}
> >    stats.
>
> Queued, thanks!

Thanks Paolo!

I should have the shadow MMU implementation out for review shortly.

>
> Paolo
>
> > Version Log
> > ===========
> >
> > v2:
> >
> > [Overall Changes]
> >   - Additional testing by live migrating a Linux VM (see above).
> >   - Add Sean's, Ben's, and Peter's Reviewed-by tags.
> >   - Use () when referring to functions in commit message and comments [Sean]
> >   - Add TDP MMU to shortlog where applicable [Sean]
> >   - Fix gramatical errors in commit messages [Sean]
> >   - Break 80+ char function declarations across multiple lines [Sean]
> >
> > [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
> >   - Remove useless empty line [Peter]
> >   - Tighten up the wording in comments [Sean]
> >   - Move result of rcu_dereference() to a local variable to cut down line lengths [Sean]
> >
> > [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
> >   - Add prep patch to return 0/-EBUSY instead of bool [Sean]
> >   - Add prep patch to rename {un,}link_page to {un,}link_sp [Sean]
> >   - Fold tdp_mmu_link_page() into tdp_mmu_install_sp_atomic() [Sean]
> >
> > [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
> >   - Make inline [Sean]
> >   - Eliminate WARN_ON_ONCE [Sean]
> >   - Eliminate unnecessary local variable new_spte [Sean].
> >
> > [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
> >   - Eliminate unnecessary local variable root_pt [Sean]
> >
> > [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
> >   - Eliminate redundant role overrides [Sean]
> >
> > [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
> >   - Rename alloc_tdp_mmu_page*() functions [Sean]
> >
> > [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
> >   - Drop access from make_huge_page_split_spte() [Sean]
> >   - Drop is_mmio_spte() check from make_huge_page_split_spte() [Sean]
> >   - Change WARN_ON to WARN_ON_ONCE in make_huge_page_split_spte() [Sean]
> >   - Improve comment for making 4K SPTEs executable [Sean]
> >   - Rename mark_spte_executable() to mark_spte_executable() [Sean]
> >   - Put return type on same line as tdp_mmu_split_huge_page_atomic() [Sean]
> >   - Drop child_spte local variable in tdp_mmu_split_huge_page_atomic() [Sean]
> >   - Make alloc_tdp_mmu_page_for_split() play nice with
> >     commit 3a0f64de479c ("KVM: x86/mmu: Don't advance iterator after restart due to yielding") [Sean]
> >   - Free unused sp after dropping RCU [Sean]
> >   - Rename module param to something shorter [Sean]
> >   - Document module param somewhere [Sean]
> >   - Fix rcu_read_unlock in tdp_mmu_split_huge_pages_root() [me]
> >   - Document TLB flush behavior [Peter]
> >
> > [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked
> >   - Drop [Peter]
> >
> > [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
> >   - Hold the lock in write-mode when splitting [Peter]
> >   - Document TLB flush behavior [Peter]
> >
> > [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
> >   - Record if split succeeded or failed [Sean]
> >
> > v1: https://lore.kernel.org/kvm/20211213225918.672507-1-dmatlack@xxxxxxxxxx/
> >
> > [Overall Changes]
> >   - Use "huge page" instead of "large page" [Sean Christopherson]
> >
> > [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
> >   - Add Ben's Reviewed-by.
> >   - Add Peter's Reviewed-by.
> >
> > [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
> >   - Add comment when updating old_spte [Ben Gardon]
> >   - Follow kernel style of else case in zap_gfn_range [Ben Gardon]
> >   - Don't delete old_spte update after zapping in kvm_tdp_mmu_map [me]
> >
> > [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
> >   - Add blurb to commit message describing intentional drop of tracepoint [Ben Gardon]
> >   - Consolidate "u64 spte = make_nonleaf_spte(...);" onto one line [Sean Christopherson]
> >   - Do not free the sp if set fails  [Sean Christopherson]
> >
> > [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
> >   - Drop to adopt Sean's proposed allocation scheme.
> >
> > [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
> >   - No changes.
> >
> > [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
> >   - Drop to adopt Sean's proposed allocation scheme.
> >
> > [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
> >   - Drop this commit and the helper function [Sean Christopherson]
> >
> > [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
> >   - Add Ben's Reviewed-by.
> >
> > [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched
> >   - Drop to adopt Sean's proposed allocation scheme.
> >
> > [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
> >   - Add Ben's Reviewed-by.
> >
> > [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
> >   - Add a module parameter to control Eager Page Splitting [Peter Xu]
> >   - Change level to large_spte_level [Ben Gardon]
> >   - Get rid of BUG_ONs [Ben Gardon]
> >   - Change += to |= and add a comment [Ben Gardon]
> >   - Do not flush TLBs when dropping the MMU lock. [Sean Christopherson]
> >   - Allocate memory directly from the kernel instead of using mmu_caches [Sean Christopherson]
> >
> > [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
> >   - Fix deadlock by refactoring MMU locking and dropping write lock before splitting. [kernel test robot]
> >   - Did not follow Sean's suggestion of skipping write-protection if splitting
> >     succeeds as it would require extra complexity since we aren't splitting
> >     pages in the shadow MMU yet.
> >
> > [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages
> >   - No changes.
> >
> > [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
> >   - Squash into patch that first introduces page splitting.
> >
> > Note: I opted not to change TDP MMU functions to return int instead of
> > bool per Sean's suggestion. I agree this change should be done but can
> > be left to a separate series.
> >
> > RFC: https://lore.kernel.org/kvm/20211119235759.1304274-1-dmatlack@xxxxxxxxxx/
> >
> > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@xxxxxxxxxxxxxx/#t
> >
> > David Matlack (18):
> >    KVM: x86/mmu: Rename rmap_write_protect() to
> >      kvm_vcpu_write_protect_gfn()
> >    KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect()
> >    KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
> >    KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return
> >      0/-EBUSY
> >    KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages
> >    KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to
> >      handle_removed_pt()
> >    KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU
> >      page table
> >    KVM: x86/mmu: Remove unnecessary warnings from
> >      restore_acc_track_spte()
> >    KVM: x86/mmu: Drop new_spte local variable from
> >      restore_acc_track_spte()
> >    KVM: x86/mmu: Move restore_acc_track_spte() to spte.h
> >    KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root
> >    KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages
> >    KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
> >    KVM: x86/mmu: Separate TDP MMU shadow page allocation and
> >      initialization
> >    KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty
> >      logging is enabled
> >    KVM: x86/mmu: Split huge pages mapped by the TDP MMU during
> >      KVM_CLEAR_DIRTY_LOG
> >    KVM: x86/mmu: Add tracepoint for splitting huge pages
> >    KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and
> >      INITIALLY_SET
> >
> >   .../admin-guide/kernel-parameters.txt         |  26 ++
> >   arch/x86/include/asm/kvm_host.h               |   7 +
> >   arch/x86/kvm/mmu/mmu.c                        |  79 ++--
> >   arch/x86/kvm/mmu/mmutrace.h                   |  23 +
> >   arch/x86/kvm/mmu/spte.c                       |  59 +++
> >   arch/x86/kvm/mmu/spte.h                       |  16 +
> >   arch/x86/kvm/mmu/tdp_iter.c                   |   8 +-
> >   arch/x86/kvm/mmu/tdp_iter.h                   |  10 +-
> >   arch/x86/kvm/mmu/tdp_mmu.c                    | 419 +++++++++++++-----
> >   arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +
> >   arch/x86/kvm/x86.c                            |   6 +
> >   arch/x86/kvm/x86.h                            |   2 +
> >   .../selftests/kvm/dirty_log_perf_test.c       |  13 +-
> >   13 files changed, 520 insertions(+), 153 deletions(-)
> >
> >
> > base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33
>