Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty

Ricardo Koller <ricarkol@xxxxxxxxxx> · Wed, 12 Jan 2022 22:12:20 -0800

On Wed, Jan 12, 2022 at 07:50:48PM -0800, Jing Zhang wrote:
> On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@xxxxxxxxxx> wrote:
> >
> > Hi Jing,
> >
> > On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> > > This patch is to reduce the performance degradation of guest workload during
> > > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > > permision relaxations on leaf pte can be performed under the read lock. This
> > > greatly reduces the MMU lock contention during dirty logging. With this
> > > solution, the source guest workload performance degradation can be improved
> > > by more than 60%.
> > >
> > > Problem:
> > >   * A Google internal live migration test shows that the source guest workload
> > >   performance has >99% degradation for about 105 seconds, >50% degradation
> > >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> > >   This shows that most of the time, the guest workload degradtion is above
> > >   99%, which obviously needs some improvement compared to the test result
> > >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> > >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> > >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> > >
> > > Analysis:
> > >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> > >     the number of contentions of MMU lock and the "dirty memory time" on
> > >     various VM spec.
> > >     By using test command
> > >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
> > >     Below are the results:
> > >     +-------+------------------------+-----------------------+
> > >     | #vCPU | dirty memory time (ms) | number of contentions |
> > >     +-------+------------------------+-----------------------+
> > >     | 1     | 926                    | 0                     |
> > >     +-------+------------------------+-----------------------+
> > >     | 2     | 1189                   | 4732558               |
> > >     +-------+------------------------+-----------------------+
> > >     | 4     | 2503                   | 11527185              |
> > >     +-------+------------------------+-----------------------+
> > >     | 8     | 5069                   | 24881677              |
> > >     +-------+------------------------+-----------------------+
> > >     | 16    | 10340                  | 50347956              |
> > >     +-------+------------------------+-----------------------+
> > >     | 32    | 20351                  | 100605720             |
> > >     +-------+------------------------+-----------------------+
> > >     | 64    | 40994                  | 201442478             |
> > >     +-------+------------------------+-----------------------+
> > >
> > >   * From the test results above, the "dirty memory time" and the number of
> > >     MMU lock contention scale with the number of vCPUs. That means all the
> > >     dirty memory operations from all vCPU threads have been serialized by
> > >     the MMU lock. Further analysis also shows that the permission relaxation
> > >     during dirty logging is where vCPU threads get serialized.
> > >
> > > Solution:
> > >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> > >     the dirty-bit solution for dirty logging is much complicated compared to
> > >     the write-protection solution. The straight way to reduce the guest
> > >     performance degradation is to enhance the concurrency for the permission
> > >     fault path during dirty logging.
> > >   * In this patch, we only put leaf PTE permission relaxation for dirty
> > >     logging under read lock, all others would go under write lock.
> > >     Below are the results based on the solution:
> > >     +-------+------------------------+
> > >     | #vCPU | dirty memory time (ms) |
> > >     +-------+------------------------+
> > >     | 1     | 803                    |
> > >     +-------+------------------------+
> > >     | 2     | 843                    |
> > >     +-------+------------------------+
> > >     | 4     | 942                    |
> > >     +-------+------------------------+
> > >     | 8     | 1458                   |
> > >     +-------+------------------------+
> > >     | 16    | 2853                   |
> > >     +-------+------------------------+
> > >     | 32    | 5886                   |
> > >     +-------+------------------------+
> > >     | 64    | 12190                  |
> > >     +-------+------------------------+
> >
> > Just curious, do yo know why is time still doubling (roughly) with the
> > number of cpus? maybe you performed another experiment or have some
> > guess(es).
> Yes. it is from the serialization caused by TLB flush whenever the
> permission is relaxed. I tried test by removing the TLB flushes (of
> course it shouldn't be removed), the time would be close to a constant
> no matter the number of vCPUs.

Got it, thanks for the info.

Ricardo

> >
> > Thanks,
> > Ricardo
> >
> > >     All "dirty memory time" have been reduced by more than 60% when the
> > >     number of vCPU grows.
> > >
> > > ---
> > >
> > > Jing Zhang (3):
> > >   KVM: arm64: Use read/write spin lock for MMU protection
> > >   KVM: arm64: Add fast path to handle permission relaxation during dirty
> > >     logging
> > >   KVM: selftests: Add vgic initialization for dirty log perf test for
> > >     ARM
> > >
> > >  arch/arm64/include/asm/kvm_host.h             |  2 +
> > >  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
> > >  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
> > >  3 files changed, 80 insertions(+), 18 deletions(-)
> > >
> > >
> > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> > > --
> > > 2.34.1.575.g55b058a8bb-goog
> > >
> > > _______________________________________________
> > > kvmarm mailing list
> > > kvmarm@xxxxxxxxxxxxxxxxxxxxx
> > > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> Thanks,
> Jing