Re: [Qemu-devel] [PATCH 2/8] migration: stop allocating and freeing memory frequently

Xiao Guangrong <guangrong.xiao@xxxxxxxxx> · Thu, 22 Mar 2018 19:57:54 +0800

On 03/21/2018 05:06 PM, Peter Xu wrote:
On Tue, Mar 13, 2018 at 03:57:33PM +0800, guangrong.xiao@xxxxxxxxx wrote:
From: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxx>

Current code uses compress2()/uncompress() to compress/decompress
memory, these two function manager memory allocation and release
internally, that causes huge memory is allocated and freed very
frequently

More worse, frequently returning memory to kernel will flush TLBs
and trigger invalidation callbacks on mmu-notification which
interacts with KVM MMU, that dramatically reduce the performance
of VM

So, we maintain the memory by ourselves and reuse it for each
compression and decompression

Signed-off-by: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxx>
---
  migration/qemu-file.c |  34 ++++++++++--
  migration/qemu-file.h |   6 ++-
  migration/ram.c       | 142 +++++++++++++++++++++++++++++++++++++-------------
  3 files changed, 140 insertions(+), 42 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2ab2bf362d..1ff33a1ffb 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -658,6 +658,30 @@ uint64_t qemu_get_be64(QEMUFile *f)
      return v;
  }
  
+/* return the size after compression, or negative value on error */
+static int qemu_compress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
+                              const uint8_t *source, size_t source_len)
+{
+    int err;
+
+    err = deflateReset(stream);

I'm not familiar with zlib, but I saw this in manual:

  https://www.zlib.net/manual.html

  This function is equivalent to deflateEnd followed by deflateInit,
  but does not free and reallocate the internal compression state. The
  stream will leave the compression level and any other attributes that
  may have been set unchanged.

I thought it was deflateInit() who is slow?  Can we avoid the reset as

deflateEnd() is worse as it frees memory to kernel which triggers
TLB flush and mmu-notifier.

long as we make sure to deflateInit() before doing anything else?

Actually, deflateReset() is cheap... :)


Meanwhile, is there any performance number for this single patch?
Since I thought the old code is calling compress2() which contains
deflateInit() and deflateEnd() too, just like what current patch do?

No, after the patch, we just call deflateInit() / deflateEnd() one
time (in _setup() handler and _cleanup handler).

Yes. This is the perf data from our production,
after revert this patch:
+  57.88%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
+  10.55%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   4.83%  kqemu  [kernel.kallsyms]        [k] flush_tlb_func_common

-   1.16%  kqemu  [kernel.kallsyms]        [k] lock_acquire                                       ▒
   - lock_acquire                                                                                 ▒
      - 15.68% _raw_spin_lock                                                                     ▒
         + 29.42% __schedule                                                                      ▒
         + 29.14% perf_event_context_sched_out                                                    ▒
         + 23.60% tdp_page_fault                                                                  ▒
         + 10.54% do_anonymous_page                                                               ▒
         + 2.07% kvm_mmu_notifier_invalidate_range_start                                          ▒
         + 1.83% zap_pte_range                                                                    ▒
         + 1.44% kvm_mmu_notifier_invalidate_range_end


apply our work:
+  51.92%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
+  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.47%  kqemu  [kernel.kallsyms]        [k] mark_lock.clone.0
+   1.46%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
+   1.31%  kqemu  [kernel.kallsyms]        [k] lock_acquire
+   1.24%  kqemu  libc-2.12.so             [.] __memset_sse2

-  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire                                     ▒
   - __lock_acquire                                                                               ▒
      - 99.75% lock_acquire                                                                       ▒
         - 18.38% _raw_spin_lock                                                                  ▒
            + 39.62% tdp_page_fault                                                               ▒
            + 31.32% __schedule                                                                   ▒
            + 27.53% perf_event_context_sched_out                                                 ▒
            + 0.58% hrtimer_interrupt


You can see the TLB flush and mmu-lock contention have gone after this patch.


It would be nice too if we can split the patch into two (decode,
encode) if you want, but that's optional.

That's good to me, thank you, Peter.