Re: [PATCH] KVM: x86/mmu: Process atomically-zapped SPTEs after replacing REMOVED_SPTE

Vipin Sharma <vipinsh@xxxxxxxxxx> · Thu, 7 Mar 2024 13:34:19 -0800



On Thu, Mar 7, 2024 at 11:41 AM David Matlack <dmatlack@xxxxxxxxxx> wrote:
>
> Process SPTEs zapped under the read-lock after the TLB flush and
> replacement of REMOVED_SPTE with 0. This minimizes the contention on the
> child SPTEs (if zapping an SPTE that points to a page table) and
> minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE.
>
> In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to
> process a 1GiB region mapped with 4KiB entries, e.g. when disabling
> dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a
> vCPU accesses the 1GiB region being zapped it will be stalled until KVM
> finishes processing the SPTE and replaces the REMOVED_SPTE with 0.
>
> Re-ordering the processing does speed up the atomic-zaps somewhat, but
> the main benefit is avoiding blocking vCPU threads.
>
> Before:
>
>  $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
>  ...
>  Disabling dirty logging time: 509.765146313s
>
>  $ ./funclatency -m tdp_mmu_zap_spte_atomic
>
>      msec                : count    distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 8        |**                                      |
>        256 -> 511        : 68       |******************                      |
>        512 -> 1023       : 129      |**********************************      |
>       1024 -> 2047       : 151      |****************************************|
>       2048 -> 4095       : 60       |***************                         |
>
> After:
>
>  $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e
>  ...
>  Disabling dirty logging time: 336.516838548s
>
>  $ ./funclatency -m tdp_mmu_zap_spte_atomic
>
>      msec                : count    distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 12       |**                                      |
>        256 -> 511        : 166      |****************************************|
>        512 -> 1023       : 101      |************************                |
>       1024 -> 2047       : 137      |*********************************       |

Nice! Whole 2048-> 4095 is gone.

>
> KVM's processing of collapsible SPTEs is still extremely slow and can be
> improved. For example, a significant amount of time is spent calling
> kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is
> redundant when processing SPTEs that all map the folio.
>
> Cc: Vipin Sharma <vipinsh@xxxxxxxxxx>
> Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 81 ++++++++++++++++++++++++++------------
>  1 file changed, 55 insertions(+), 26 deletions(-)
>
Reviewed-by: Vipin Sharma <vipinsh@xxxxxxxxxx>