Re: [Qemu-devel] [PATCH v3 00/35] postcopy live migration

Isaku Yamahata <yamahata@xxxxxxxxxxxxx> · Wed, 31 Oct 2012 12:25:35 +0900

On Tue, Oct 30, 2012 at 06:53:31PM +0000, Benoit Hudzia wrote:
> Hi Isaku,
> 
> 
> Are you going to be at the KVM forum ( i think you have a presentation there).
> It would be nice if we could meet in order to see if we can synch our efforts .

Yes, definitively.

> As you know we have been developing an RDMA based solution for post copy
> migration and  we demonstrated the initial proof of concept in december 2012 (
> we published some finding  in VHPC 2012 and are working with Petter Svard from
> Umea on a journal paper with more detailed performance review) .

Do you have any pointers to available papers/slides?
I can't find any at http://vhpc.org/

> While  RDMA post copy live migration is just of by product of our long term
> effort ( i will present the project  in my talk at KVM forum)  we grabbed the
> opportunity  to address problems we were facing with the live migration of
> enterprise workload . Namely how to migrate in memory database such has HANA
> under load.
> 
> We quickly discovered that pre copy ( even with optimization ) didn't work with
> such workload. We also tried your code however the performance where far from
> satisfying with large VM under load due to the heavy cost of transferring
> memory between user space - kernel multiple time ( actually it often failed)

If possible, I'd like to see the details.

> We then tested a   pure RDMA solution we developed  ( we suport HW and software
> RDMA )   and it work fine with all the  workload we tested  ( we migrated VM
> with 20+ GB running SAP HANA under a workload similar to TPC-H) and we hop to
> test with bigger configuration soon ( 1/2 + TB of memory) .
> 
> However the state of integration of our code with the QEMU -code base is not as
> advanced and polished as the one you currently have and i would like to know if
> you would be interested in trying to join our effort or collaborate in merging
> our solution. Or maybe allowing us to piggy back on your effort.

Yeah, we can unite our efforts for the upstream.
Especially clean interface for both non-RDMA/RDMA (qemu internal/qemu-kernel)
is important.
At the moment I have no clue to the requirement of RDMA postcopy and
your implementation.
"transparently integrating with the MMU at the OS level" sounds interesting.

thanks,

> Would you bee free to meet at any time next week ? ( from Tuesday to Friday)
> 
> Ps: we would be open sourcing our project by the end of the month of November
> and the post copy is only a small part of the technology developed.
> 
> .
> 
> 
> Regards
> Benoit
> 
> 
> On 30 October 2012 08:32, Isaku Yamahata <yamahata@xxxxxxxxxxxxx> wrote:
> 
>     This is the v3 patch series of postcopy migration.
> 
>     The trees is available at
>     git://github.com/yamahata/qemu.git qemu-postcopy-oct-30-2012
>     git://github.com/yamahata/linux-umem.git linux-umem-oct-29-2012
> 
>     Major changes v2 -> v3:
>     - implemented pre+post optimization
>     - auto detection of postcopy by incoming side
>     - using threads on destination instead of fork
>     - using blocking io instead of select + non-blocking io loop
>     - less memory overhead
>     - various improvement and code simplification
>     - kernel module name change umem -> uvmem to avoid name conflict.
> 
>     Patches organization:
>     1-2: trivial fixes
>     3-5: prepartion for threading. cherry-picked from migration tree
>     6-18: refactoring existing code and preparation
>     19-25: implement postcopy live migration itself (essential part)
>     26-35: optimization/heuristic for postcopy
> 
>     Usage
>     =====
>     You need load uvmem character device on the host before starting migration.
>     Postcopy can be used for tcg and kvm accelarator. The implementation depend
>     on only linux uvmem character device. But the driver dependent code is
>     split
>     into a file.
>     I tested only host page size == guest page size case, but the
>     implementation
>     allows host page size != guest page size case.
> 
>     The following options are added with this patch series.
>     - incoming part
>       use -incoming as usual. Postcopy is automatically detected.
>       example:
>       qemu -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> 
>     - outging part
>       options for migrate command
>       migrate [-p [-n] [-m]] URI
>               [<precopy count> [<prefault forward> [<prefault backword>]]]
> 
>       Newly added options/arguments
>       -p: indicate postcopy migration
>       -n: disable background transferring pages: This is for benchmark/
>     debugging
>       -m: move background transfer of postcopy mode
>       <precopy count>: The number of precopy RAM scan before postcopy.
>                        default 0 (0 means no precopy)
>       <prefault forward>: The number of forward pages which is sent with
>     on-demand
>       <prefault backward>: The number of backward pages which is sent with
>                            on-demand
> 
>       example:
>       migrate -p -n tcp:<dest ip address>:4444
>       migrate -p -n -m tcp:<dest ip address>:4444 42 42 0
> 
> 
>     TODO
>     ====
>     - benchmark/evaluation
>     - improve/optimization
>       At the moment at least what I'm aware of is
>       - pre+post case
>         On desitnation side reading dirty bitmap would cause long latency.
>         create thread for that.
>     - consider on FUSE/CUSE possibility
> 
>     basic postcopy work flow
>     ========================
>             qemu on the destination
>                   |
>                   V
>             open(/dev/uvmem)
>                   |
>                   V
>             UVMEM_INIT
>                   |
>                   V
>             Here we have two file descriptors to
>             umem device and shmem file
>                   |
>                   |                                  umem threads
>                   |                                  on the destination
>                   |
>                   V    create pipe to communicate
>             crete threads--------------------------------,
>                   |                                      |
>                   V                                   mmap(shmem file)
>             mmap(uvmem device) for guest RAM          close(shmem file)
>                   |                                      |
>                   |                                      |
>                   V                                      |
>             wait for ready from daemon <----pipe-----send ready message
>                   |                                      |
>                   |                                 Here the daemon takes over
>             send ok------------pipe---------------> the owner of the socket
>                   |                                 to the source
>                   V                                      |
>             entering post copy stage                     |
>             start guest execution                        |
>                   |                                      |
>                   V                                      V
>             access guest RAM                          read() to get faulted
>     pages
>                   |                                      |
>                   V                                      V
>             page fault ------------------------------>page offset is returned
>             block                                        |
>                                                          V
>                                                       pull page from the source
>                                                       write the page contents
>                                                       to the shmem.
>                                                          |
>                                                          V
>             unblock     <-----------------------------write() to tell served
>     pages
>             the fault handler returns the page           |
>             page fault is resolved                       |
>                   |                                      V
>                   |                                   touch guest RAM pages
>                   |                                      |
>                   |                                      V
>                   |                                   release the cached page
>                   |                                   madvise(MADV_REMOVE)
>                   |
>                   |
>                   |                                   pages can be sent
>                   |                                   backgroundly
>                   |                                      |
>                   |                                      V
>                   |                                   mark page is cached
>                   |                                   Thus future page fault is
>                   |                                   avoided.
>                   |                                      |
>                   |                                      V
>                   |                                   touch guest RAM pages
>                   |                                      |
>                   |                                      V
>                   |                                   release the cached page
>                   |                                   madvise(MADV_REMOVE)
>                   |                                      |
>                   V                                      V
> 
>                      all the pages are pulled from the source
> 
>                   |                                      |
>                   V                                      V
>             migration completes                        exit()
> 
> 
>     Isaku Yamahata (32):
>       migration.c: remove redundant line in migrate_init()
>       arch_init: DPRINTF format error and typo
>       osdep: add qemu_read_full() to read interrupt-safely
>       savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip,
>         qemu_fflush
>       savevm/QEMUFile: consolidate QEMUFile functions a bit
>       savevm/QEMUFile: introduce qemu_fopen_fd
>       savevm/QEMUFile: add read/write QEMUFile on memory buffer
>       savevm, buffered_file: introduce method to drain buffer of buffered
>         file
>       arch_init: export RAM_SAVE_xxx flags for postcopy
>       arch_init/ram_save: introduce constant for ram save version = 4
>       arch_init: refactor ram_save_block() and export ram_save_block()
>       arch_init/ram_save_setup: factor out bitmap alloc/free
>       arch_init/ram_load: refactor ram_load
>       arch_init: factor out logic to find ram block with id string
>       migration: export migrate_fd_completed() and migrate_fd_cleanup()
>       uvmem.h: import Linux uvmem.h and teach update-linux-headers.sh
>       osdep: add QEMU_MADV_REMOVE and tirivial fix
>       postcopy: introduce helper functions for postcopy
>       savevm: add new section that is used by postcopy
>       postcopy: implement incoming part of postcopy live migration
>       postcopy outgoing: add -p option to migrate command
>       postcopy: implement outgoing part of postcopy live migration
>       postcopy/outgoing: add -n options to disable background transfer
>       postcopy/outgoing: implement forward/backword prefault
>       arch_init: factor out setting last_block, last_offset
>       postcopy/outgoing: add movebg mode(-m) to migration command
>       arch_init: factor out ram_load
>       arch_init: export ram_save_iterate()
>       postcopy: pre+post optimization incoming side
>       arch_init: export migration_bitmap_sync and helper method to get
>         bitmap
>       postcopy/outgoing: introduce precopy_count parameter
>       postcopy: pre+post optimization outgoing side
> 
>     Paolo Bonzini (1):
>       split MRU ram list
> 
>     Umesh Deshpande (2):
>       add a version number to ram_list
>       protect the ramlist with a separate mutex
> 
>      Makefile.target                 |    2 +
>      arch_init.c                     |  391 +++++---
>      arch_init.h                     |   24 +
>      buffered_file.c                 |   59 +-
>      buffered_file.h                 |    1 +
>      cpu-all.h                       |   16 +-
>      exec.c                          |   62 +-
>      hmp-commands.hx                 |   21 +-
>      hmp.c                           |   12 +-
>      linux-headers/linux/uvmem.h     |   41 +
>      migration-exec.c                |    8 +-
>      migration-fd.c                  |   23 +-
>      migration-postcopy.c            | 2019
>     +++++++++++++++++++++++++++++++++++++++
>      migration-tcp.c                 |   16 +-
>      migration-unix.c                |   36 +-
>      migration.c                     |   65 +-
>      migration.h                     |   42 +
>      osdep.c                         |   24 +
>      osdep.h                         |   13 +-
>      qapi-schema.json                |    6 +-
>      qemu-common.h                   |    2 +
>      qemu-file.h                     |   12 +-
>      qmp-commands.hx                 |    4 +-
>      savevm.c                        |  223 ++++-
>      scripts/update-linux-headers.sh |    2 +-
>      sysemu.h                        |    2 +-
>      umem.c                          |  291 ++++++
>      umem.h                          |   88 ++
>      vl.c                            |    5 +-
>      29 files changed, 3265 insertions(+), 245 deletions(-)
>      create mode 100644 linux-headers/linux/uvmem.h
>      create mode 100644 migration-postcopy.c
>      create mode 100644 umem.c
>      create mode 100644 umem.h
> 
>     --
>     1.7.10.4
> 
>     --
>     To unsubscribe from this list: send the line "unsubscribe kvm" in
>     the body of a message to majordomo@xxxxxxxxxxxxxxx
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> --
> " The production of too many useful things results in too many useless people"

-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html