Re: [PATCH 00/22] Introduce the TDP MMU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 25/09/20 23:22, Ben Gardon wrote:
> Over the years, the needs for KVM's x86 MMU have grown from running small
> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
> we previously depended on shadow paging to run all guests, we now have
> two dimensional paging (TDP). This patch set introduces a new
> implementation of much of the KVM MMU, optimized for running guests with
> TDP. We have re-implemented many of the MMU functions to take advantage of
> the relative simplicity of TDP and eliminate the need for an rmap.
> Building on this simplified implementation, a future patch set will change
> the synchronization model for this "TDP MMU" to enable more parallelism
> than the monolithic MMU lock. A TDP MMU is currently in use at Google
> and has given us the performance necessary to live migrate our 416 vCPU,
> 12TiB m2-ultramem-416 VMs.
> 
> This work was motivated by the need to handle page faults in parallel for
> very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
> KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
> long latency on guest page faults. This contention can be easily seen
> running the KVM selftests demand_paging_test with a couple hundred vCPUs.
> Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
> per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
> the TDP MMU reduced the test duration by 89% and the execution was
> dominated by get_user_pages and the user fault FD ioctl instead of the
> MMU lock.
> 
> This series is the first of two. In this series we add a basic
> implementation of the TDP MMU. In the next series we will improve the
> performance of the TDP MMU and allow it to execute MMU operations
> in parallel.
> 
> The overall purpose of the KVM MMU is to program paging structures
> (CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
> addresses (HPA), and to provide utilities for other KVM features, for
> example dirty logging. The definition of the L1 guest physical address
> (GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
> and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
> MMU must program the x86 page tables to encode the full translation of
> guest virtual addresses (GVA) to HPA. This requires "shadowing" the
> guest's page tables to create a composite x86 paging structure. This
> solution is complicated, requires separate paging structures for each
> guest CR3, and requires emulating guest page table changes. The TDP case
> is much simpler. In this case, KVM lets the guest control CR3 and programs
> the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
> no way to change this mapping and only one version of the paging structure
> is needed per L1 paging mode. In this case the paging mode is some
> combination of the number of levels in the paging structure, the address
> space (normal execution or system management mode, on x86), and other
> attributes. Most VMs only ever use 1 paging mode and so only ever need one
> TDP structure.
> 
> This series implements a "TDP MMU" through alternative implementations of
> MMU functions for running L1 guests with TDP. The TDP MMU falls back to
> the existing shadow paging implementation when TDP is not available, and
> interoperates with the existing shadow paging implementation for nesting.
> The use of the TDP MMU can be controlled by a module parameter which is
> snapshot on VM creation and follows the life of the VM. This snapshot
> is used in many functions to decide whether or not to use TDP MMU handlers
> for a given operation.
> 
> This series can also be viewed in Gerrit here:
> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> (Thanks to Dmitry Vyukov <dvyukov@xxxxxxxxxx> for setting up the
> Gerrit instance)
> 
> Ben Gardon (22):
>   kvm: mmu: Separate making SPTEs from set_spte
>   kvm: mmu: Introduce tdp_iter
>   kvm: mmu: Init / Uninit the TDP MMU
>   kvm: mmu: Allocate and free TDP MMU roots
>   kvm: mmu: Add functions to handle changed TDP SPTEs
>   kvm: mmu: Make address space ID a property of memslots
>   kvm: mmu: Support zapping SPTEs in the TDP MMU
>   kvm: mmu: Separate making non-leaf sptes from link_shadow_page
>   kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
>   kvm: mmu: Add TDP MMU PF handler
>   kvm: mmu: Factor out allocating a new tdp_mmu_page
>   kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
>   kvm: mmu: Support invalidate range MMU notifier for TDP MMU
>   kvm: mmu: Add access tracking for tdp_mmu
>   kvm: mmu: Support changed pte notifier in tdp MMU
>   kvm: mmu: Add dirty logging handler for changed sptes
>   kvm: mmu: Support dirty logging for the TDP MMU
>   kvm: mmu: Support disabling dirty logging for the tdp MMU
>   kvm: mmu: Support write protection for nesting in tdp MMU
>   kvm: mmu: NX largepage recovery for TDP MMU
>   kvm: mmu: Support MMIO in the TDP MMU
>   kvm: mmu: Don't clear write flooding count for direct roots
> 
>  arch/x86/include/asm/kvm_host.h |   17 +
>  arch/x86/kvm/Makefile           |    3 +-
>  arch/x86/kvm/mmu/mmu.c          |  437 ++++++----
>  arch/x86/kvm/mmu/mmu_internal.h |   98 +++
>  arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
>  arch/x86/kvm/mmu/tdp_iter.c     |  198 +++++
>  arch/x86/kvm/mmu/tdp_iter.h     |   55 ++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 1315 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h      |   52 ++
>  include/linux/kvm_host.h        |    2 +
>  virt/kvm/kvm_main.c             |    7 +-
>  11 files changed, 2022 insertions(+), 165 deletions(-)
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
> 

Ok, I've not finished reading the code but I have already an idea of
what it's like.  I really think we should fast track this as the basis
for more 5.11 work.  I'll finish reviewing it and, if you don't mind, I
might make some of the changes myself so I have the occasion to play and
get accustomed to the code; speak up if you disagree with them though!
Another thing I'd like to add is a few tracepoints.

Paolo




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux