Hi Fuad, On Tue, Jan 25, 2022 at 5:22 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote: > > Hi Jing, > > On Tue, Jan 18, 2022 at 1:57 AM Jing Zhang <jingzhangos@xxxxxxxxxx> wrote: > > > > This patch is to reduce the performance degradation of guest workload during > > dirty logging on ARM64. A fast path is added to handle permission relaxation > > during dirty logging. The MMU lock is replaced with rwlock, by which all > > permision relaxations on leaf pte can be performed under the read lock. This > > greatly reduces the MMU lock contention during dirty logging. With this > > solution, the source guest workload performance degradation can be improved > > by more than 60%. > > > > Problem: > > * A Google internal live migration test shows that the source guest workload > > performance has >99% degradation for about 105 seconds, >50% degradation > > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > > This shows that most of the time, the guest workload degradtion is above > > 99%, which obviously needs some improvement compared to the test result > > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB, PageSize: 4K > > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB, PageSize: 4K, 2M hugepage backed > > > > Analysis: > > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > > the number of contentions of MMU lock and the "dirty memory time" on > > various VM spec. The "dirty memory time" is the time vCPU threads spent > > in KVM after fault. Higher "dirty memory time" means higher degradation > > to guest workload. > > '-m 2' specifies the mode "PA-bits:48, VA-bits:48, 4K pages". > > By using test command > > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] > > Below are the results: > > +-------+------------------------+-----------------------+ > > | #vCPU | dirty memory time (ms) | number of contentions | > > +-------+------------------------+-----------------------+ > > | 1 | 926 | 0 | > > +-------+------------------------+-----------------------+ > > | 2 | 1189 | 4732558 | > > +-------+------------------------+-----------------------+ > > | 4 | 2503 | 11527185 | > > +-------+------------------------+-----------------------+ > > | 8 | 5069 | 24881677 | > > +-------+------------------------+-----------------------+ > > | 16 | 10340 | 50347956 | > > +-------+------------------------+-----------------------+ > > | 32 | 20351 | 100605720 | > > +-------+------------------------+-----------------------+ > > | 64 | 40994 | 201442478 | > > +-------+------------------------+-----------------------+ > > > > * From the test results above, the "dirty memory time" and the number of > > MMU lock contention scale with the number of vCPUs. That means all the > > dirty memory operations from all vCPU threads have been serialized by > > the MMU lock. Further analysis also shows that the permission relaxation > > during dirty logging is where vCPU threads get serialized. > > I am curious about any changes to performance for this case (the base > case) with the changes in patch 3. > > Thanks, > /fuad > > > > > > > Solution: > > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > > the dirty-bit solution for dirty logging is much complicated compared to > > the write-protection solution. The straight way to reduce the guest > > performance degradation is to enhance the concurrency for the permission > > fault path during dirty logging. > > * In this patch, we only put leaf PTE permission relaxation for dirty > > logging under read lock, all others would go under write lock. > > Below are the results based on the fast path solution: > > +-------+------------------------+ > > | #vCPU | dirty memory time (ms) | > > +-------+------------------------+ > > | 1 | 965 | > > +-------+------------------------+ > > | 2 | 1006 | > > +-------+------------------------+ > > | 4 | 1128 | > > +-------+------------------------+ > > | 8 | 2005 | > > +-------+------------------------+ > > | 16 | 3903 | > > +-------+------------------------+ > > | 32 | 7595 | > > +-------+------------------------+ > > | 64 | 15783 | > > +-------+------------------------+ > > > > * Furtuer analysis shows that there is another bottleneck caused by the > > setup of the test code itself. The 3rd commit is meant to fix that by > > setting up vgic in the test code. With the test code fix, below are > > the results which show better improvement. > > +-------+------------------------+ > > | #vCPU | dirty memory time (ms) | > > +-------+------------------------+ > > | 1 | 803 | > > +-------+------------------------+ > > | 2 | 843 | > > +-------+------------------------+ > > | 4 | 942 | > > +-------+------------------------+ > > | 8 | 1458 | > > +-------+------------------------+ > > | 16 | 2853 | > > +-------+------------------------+ > > | 32 | 5886 | > > +-------+------------------------+ > > | 64 | 12190 | > > +-------+------------------------+ > > All "dirty memory time" has been reduced by more than 60% when the > > number of vCPU grows. > > * Based on the solution, the test results from the Google internal live > > migration test also shows more than 60% improvement with >99% for 30s, > > >50% for 58s and >10% for 76s. > > > > --- > > > > * v1 -> v2 > > - Renamed flag name from use_mmu_readlock to logging_perm_fault. > > - Removed unnecessary check for fault_granule to use readlock. > > * RFC -> v1 > > - Rebase to kvm/queue, commit fea31d169094 > > (KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event) > > - Moved the fast path in user_mem_abort, as suggested by Marc. > > - Addressed other comments from Marc. > > > > [v1] https://lore.kernel.org/all/20220113221829.2785604-1-jingzhangos@xxxxxxxxxx > > [RFC] https://lore.kernel.org/all/20220110210441.2074798-1-jingzhangos@xxxxxxxxxx > > > > --- > > > > Jing Zhang (3): > > KVM: arm64: Use read/write spin lock for MMU protection > > KVM: arm64: Add fast path to handle permission relaxation during dirty > > logging > > KVM: selftests: Add vgic initialization for dirty log perf test for > > ARM > > > > arch/arm64/include/asm/kvm_host.h | 2 + > > arch/arm64/kvm/mmu.c | 49 ++++++++++++------- > > .../selftests/kvm/dirty_log_perf_test.c | 10 ++++ > > 3 files changed, 43 insertions(+), 18 deletions(-) > > > > > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4 > > -- > > 2.34.1.703.g22d0c6ccf7-goog > > > > _______________________________________________ > > kvmarm mailing list > > kvmarm@xxxxxxxxxxxxxxxxxxxxx > > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm Thanks for all your reviews and testing. Jing