On Thu, Mar 7, 2024 at 11:41 AM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > Process SPTEs zapped under the read-lock after the TLB flush and > replacement of REMOVED_SPTE with 0. This minimizes the contention on the > child SPTEs (if zapping an SPTE that points to a page table) and > minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE. > > In VMs with a large (400+) vCPUs, it can take KVM multiple seconds to > process a 1GiB region mapped with 4KiB entries, e.g. when disabling > dirty logging in a VM backed by 1GiB HugeTLB. During those seconds if a > vCPU accesses the 1GiB region being zapped it will be stalled until KVM > finishes processing the SPTE and replaces the REMOVED_SPTE with 0. > > Re-ordering the processing does speed up the atomic-zaps somewhat, but > the main benefit is avoiding blocking vCPU threads. > > Before: > > $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e > ... > Disabling dirty logging time: 509.765146313s > > $ ./funclatency -m tdp_mmu_zap_spte_atomic > > msec : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 8 |** | > 256 -> 511 : 68 |****************** | > 512 -> 1023 : 129 |********************************** | > 1024 -> 2047 : 151 |****************************************| > 2048 -> 4095 : 60 |*************** | > > After: > > $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e > ... > Disabling dirty logging time: 336.516838548s > > $ ./funclatency -m tdp_mmu_zap_spte_atomic > > msec : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 12 |** | > 256 -> 511 : 166 |****************************************| > 512 -> 1023 : 101 |************************ | > 1024 -> 2047 : 137 |********************************* | Nice! Whole 2048-> 4095 is gone. > > KVM's processing of collapsible SPTEs is still extremely slow and can be > improved. For example, a significant amount of time is spent calling > kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, which is > redundant when processing SPTEs that all map the folio. > > Cc: Vipin Sharma <vipinsh@xxxxxxxxxx> > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/tdp_mmu.c | 81 ++++++++++++++++++++++++++------------ > 1 file changed, 55 insertions(+), 26 deletions(-) > Reviewed-by: Vipin Sharma <vipinsh@xxxxxxxxxx>