Hi Jing, On Tue, Jan 18, 2022 at 1:57 AM Jing Zhang <jingzhangos@xxxxxxxxxx> wrote: > > This patch is to reduce the performance degradation of guest workload during > dirty logging on ARM64. A fast path is added to handle permission relaxation > during dirty logging. The MMU lock is replaced with rwlock, by which all > permision relaxations on leaf pte can be performed under the read lock. This > greatly reduces the MMU lock contention during dirty logging. With this > solution, the source guest workload performance degradation can be improved > by more than 60%. > > Problem: > * A Google internal live migration test shows that the source guest workload > performance has >99% degradation for about 105 seconds, >50% degradation > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > This shows that most of the time, the guest workload degradtion is above > 99%, which obviously needs some improvement compared to the test result > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB, PageSize: 4K > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB, PageSize: 4K, 2M hugepage backed > > Analysis: > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > the number of contentions of MMU lock and the "dirty memory time" on > various VM spec. The "dirty memory time" is the time vCPU threads spent > in KVM after fault. Higher "dirty memory time" means higher degradation > to guest workload. > '-m 2' specifies the mode "PA-bits:48, VA-bits:48, 4K pages". > By using test command > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] > Below are the results: > +-------+------------------------+-----------------------+ > | #vCPU | dirty memory time (ms) | number of contentions | > +-------+------------------------+-----------------------+ > | 1 | 926 | 0 | > +-------+------------------------+-----------------------+ > | 2 | 1189 | 4732558 | > +-------+------------------------+-----------------------+ > | 4 | 2503 | 11527185 | > +-------+------------------------+-----------------------+ > | 8 | 5069 | 24881677 | > +-------+------------------------+-----------------------+ > | 16 | 10340 | 50347956 | > +-------+------------------------+-----------------------+ > | 32 | 20351 | 100605720 | > +-------+------------------------+-----------------------+ > | 64 | 40994 | 201442478 | > +-------+------------------------+-----------------------+ > > * From the test results above, the "dirty memory time" and the number of > MMU lock contention scale with the number of vCPUs. That means all the > dirty memory operations from all vCPU threads have been serialized by > the MMU lock. Further analysis also shows that the permission relaxation > during dirty logging is where vCPU threads get serialized. I am curious about any changes to performance for this case (the base case) with the changes in patch 3. Thanks, /fuad > > Solution: > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > the dirty-bit solution for dirty logging is much complicated compared to > the write-protection solution. The straight way to reduce the guest > performance degradation is to enhance the concurrency for the permission > fault path during dirty logging. > * In this patch, we only put leaf PTE permission relaxation for dirty > logging under read lock, all others would go under write lock. > Below are the results based on the fast path solution: > +-------+------------------------+ > | #vCPU | dirty memory time (ms) | > +-------+------------------------+ > | 1 | 965 | > +-------+------------------------+ > | 2 | 1006 | > +-------+------------------------+ > | 4 | 1128 | > +-------+------------------------+ > | 8 | 2005 | > +-------+------------------------+ > | 16 | 3903 | > +-------+------------------------+ > | 32 | 7595 | > +-------+------------------------+ > | 64 | 15783 | > +-------+------------------------+ > > * Furtuer analysis shows that there is another bottleneck caused by the > setup of the test code itself. The 3rd commit is meant to fix that by > setting up vgic in the test code. With the test code fix, below are > the results which show better improvement. > +-------+------------------------+ > | #vCPU | dirty memory time (ms) | > +-------+------------------------+ > | 1 | 803 | > +-------+------------------------+ > | 2 | 843 | > +-------+------------------------+ > | 4 | 942 | > +-------+------------------------+ > | 8 | 1458 | > +-------+------------------------+ > | 16 | 2853 | > +-------+------------------------+ > | 32 | 5886 | > +-------+------------------------+ > | 64 | 12190 | > +-------+------------------------+ > All "dirty memory time" has been reduced by more than 60% when the > number of vCPU grows. > * Based on the solution, the test results from the Google internal live > migration test also shows more than 60% improvement with >99% for 30s, > >50% for 58s and >10% for 76s. > > --- > > * v1 -> v2 > - Renamed flag name from use_mmu_readlock to logging_perm_fault. > - Removed unnecessary check for fault_granule to use readlock. > * RFC -> v1 > - Rebase to kvm/queue, commit fea31d169094 > (KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event) > - Moved the fast path in user_mem_abort, as suggested by Marc. > - Addressed other comments from Marc. > > [v1] https://lore.kernel.org/all/20220113221829.2785604-1-jingzhangos@xxxxxxxxxx > [RFC] https://lore.kernel.org/all/20220110210441.2074798-1-jingzhangos@xxxxxxxxxx > > --- > > Jing Zhang (3): > KVM: arm64: Use read/write spin lock for MMU protection > KVM: arm64: Add fast path to handle permission relaxation during dirty > logging > KVM: selftests: Add vgic initialization for dirty log perf test for > ARM > > arch/arm64/include/asm/kvm_host.h | 2 + > arch/arm64/kvm/mmu.c | 49 ++++++++++++------- > .../selftests/kvm/dirty_log_perf_test.c | 10 ++++ > 3 files changed, 43 insertions(+), 18 deletions(-) > > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4 > -- > 2.34.1.703.g22d0c6ccf7-goog > > _______________________________________________ > kvmarm mailing list > kvmarm@xxxxxxxxxxxxxxxxxxxxx > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm