On Thu, Mar 22, 2018 at 07:57:54PM +0800, Xiao Guangrong wrote: > > > On 03/21/2018 05:06 PM, Peter Xu wrote: > > On Tue, Mar 13, 2018 at 03:57:33PM +0800, guangrong.xiao@xxxxxxxxx wrote: > > > From: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxx> > > > > > > Current code uses compress2()/uncompress() to compress/decompress > > > memory, these two function manager memory allocation and release > > > internally, that causes huge memory is allocated and freed very > > > frequently > > > > > > More worse, frequently returning memory to kernel will flush TLBs > > > and trigger invalidation callbacks on mmu-notification which > > > interacts with KVM MMU, that dramatically reduce the performance > > > of VM > > > > > > So, we maintain the memory by ourselves and reuse it for each > > > compression and decompression > > > > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxx> > > > --- > > > migration/qemu-file.c | 34 ++++++++++-- > > > migration/qemu-file.h | 6 ++- > > > migration/ram.c | 142 +++++++++++++++++++++++++++++++++++++------------- > > > 3 files changed, 140 insertions(+), 42 deletions(-) > > > > > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > > > index 2ab2bf362d..1ff33a1ffb 100644 > > > --- a/migration/qemu-file.c > > > +++ b/migration/qemu-file.c > > > @@ -658,6 +658,30 @@ uint64_t qemu_get_be64(QEMUFile *f) > > > return v; > > > } > > > +/* return the size after compression, or negative value on error */ > > > +static int qemu_compress_data(z_stream *stream, uint8_t *dest, size_t dest_len, > > > + const uint8_t *source, size_t source_len) > > > +{ > > > + int err; > > > + > > > + err = deflateReset(stream); > > > > I'm not familiar with zlib, but I saw this in manual: > > > > https://www.zlib.net/manual.html > > > > This function is equivalent to deflateEnd followed by deflateInit, > > but does not free and reallocate the internal compression state. The > > stream will leave the compression level and any other attributes that > > may have been set unchanged. > > > > I thought it was deflateInit() who is slow? Can we avoid the reset as > > deflateEnd() is worse as it frees memory to kernel which triggers > TLB flush and mmu-notifier. > > > long as we make sure to deflateInit() before doing anything else? > > Actually, deflateReset() is cheap... :) > > > > > Meanwhile, is there any performance number for this single patch? > > Since I thought the old code is calling compress2() which contains > > deflateInit() and deflateEnd() too, just like what current patch do? > > No, after the patch, we just call deflateInit() / deflateEnd() one > time (in _setup() handler and _cleanup handler). > > Yes. This is the perf data from our production, > after revert this patch: > + 57.88% kqemu [kernel.kallsyms] [k] queued_spin_lock_slowpath > + 10.55% kqemu [kernel.kallsyms] [k] __lock_acquire > + 4.83% kqemu [kernel.kallsyms] [k] flush_tlb_func_common > > - 1.16% kqemu [kernel.kallsyms] [k] lock_acquire ▒ > - lock_acquire ▒ > - 15.68% _raw_spin_lock ▒ > + 29.42% __schedule ▒ > + 29.14% perf_event_context_sched_out ▒ > + 23.60% tdp_page_fault ▒ > + 10.54% do_anonymous_page ▒ > + 2.07% kvm_mmu_notifier_invalidate_range_start ▒ > + 1.83% zap_pte_range ▒ > + 1.44% kvm_mmu_notifier_invalidate_range_end > > > apply our work: > + 51.92% kqemu [kernel.kallsyms] [k] queued_spin_lock_slowpath > + 14.82% kqemu [kernel.kallsyms] [k] __lock_acquire > + 1.47% kqemu [kernel.kallsyms] [k] mark_lock.clone.0 > + 1.46% kqemu [kernel.kallsyms] [k] native_sched_clock > + 1.31% kqemu [kernel.kallsyms] [k] lock_acquire > + 1.24% kqemu libc-2.12.so [.] __memset_sse2 > > - 14.82% kqemu [kernel.kallsyms] [k] __lock_acquire ▒ > - __lock_acquire ▒ > - 99.75% lock_acquire ▒ > - 18.38% _raw_spin_lock ▒ > + 39.62% tdp_page_fault ▒ > + 31.32% __schedule ▒ > + 27.53% perf_event_context_sched_out ▒ > + 0.58% hrtimer_interrupt > > > You can see the TLB flush and mmu-lock contention have gone after this patch. Yes. Obviously I misunderstood the documentation for deflateReset(). It's not really a combined "End+Init", a quick glance in zlib code shows that deflateInit() will do the mallocs, then call deflateReset() at last. So the buffers should be kept for reset(), as you explained. > > > > > It would be nice too if we can split the patch into two (decode, > > encode) if you want, but that's optional. > > That's good to me, thank you, Peter. Thanks for explaining. -- Peter Xu