Re: [RFC] postcopy livemigration proposal

Dor Laor <dlaor@xxxxxxxxxx> · Mon, 08 Aug 2011 12:20:19 +0300

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.

Background
==========
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early.  Although many improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The existing
precopy livemigration remains as default behavior.

* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf

Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration

Design/Implementation
=====================
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
   - start migration
     stop the guest VM on the source and send the machine states except
     guest RAM to the destination
   - resume the guest VM on the destination without guest RAM contents
   - Hook guest access to pages, and pull page contents from the source
     This continues until all the pages are pulled to the destination

   The big picture is depicted at
   http://wiki.qemu.org/File:Postcopy-livemigration.png

That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
        - No down time due to memory copying.
        - Efficient, reduce needed traffic no need to re-send pages.
        - Reduce overall RAM consumption of the source and destination
        as opposed from current live migration (both the source and the
        destination allocate the memory until the live migration
        completes). We can free copied memory once the destination guest
        received it and save RAM.
        - Increase parallelism for SMP guests we can have multiple
        virtual CPU handle their demand paging . Less time to hold a
        global lock, less thread contention.
        - Virtual machines are using more and more memory resources ,
        for a virtual machine with very large working set doing live
        migration with reasonable down time is impossible today.

Disadvantageous:
        - During the live migration the guest will run slower than in
        today's live migration. We need to remember that even today
        guests suffer from performance penalty on the source during the
        COW stage (memory copy).
        - Failure of the source or destination or the network will cause
        us to lose the running virtual machine. Those failures are very
        rare.
        In case there is shared storage we can store a copy of the
        memory there , that can be recovered in case of such failure .

Overall, it looks like a better approach for the vast majority of cases.
Hope it will get merged to kvm and become the default way.

There are several design points.
   - who takes care of pulling page contents.
     an independent daemon vs a thread in qemu
     The daemon approach is preferable because an independent daemon would
     easy for debug postcopy memory mechanism without qemu.
     If required, it wouldn't be difficult to convert a daemon into
     a thread in qemu

   - connection between the source and the destination
     The connection for live migration can be re-used after sending machine
     state.

   - transfer protocol
     The existing protocol that exists today can be extended.

   - hooking guest RAM access
     Introduce a character device to handle page fault.
     When page fault occurs, it queues page request up to user space daemon
     at the destination. And the daemon pulls page contents from the source
     and serves it into the character device. Then the page fault is resovlved.

Isn't there a simpler way of using madvise verb to mark that the 
destination guest RAM will need paging?

Cheers and looking forward to the presentation over the kvm forum,
Dor

* More on hooking guest RAM access
There are several candidate for the implementation. Our preference is
character device approach.

   - inserting hooks into everywhere in qemu/kvm
     This is impractical

   - backing store for guest ram
     a block device or a file can be used to back guest RAM.
     Thus hook the guest ram access.

     pros
     - new device driver isn't needed.
     cons
     - future improvement would be difficult
     - some KVM host feature(KSM, THP) wouldn't work

   - character device
     qemu mmap() the dedicated character device, and then hook page fault.

     pros
     - straght forward approach
     - future improvement would be easy
     cons
     - new driver is needed
     - some KVM host feature(KSM, THP) wouldn't work
       They checks if a given VMA is anonymous. This can be fixed.

   - swap device
     When creating guest, it is set up as if all the guest RAM is swapped out
     to a dedicated swap device, which may be nbd disk (or some kind of user
     space block device, BUSE?).
     When the VM tries to access memory, swap-in is triggered and IO to the
     swap device is issued. Then the IO to swap is routed to the daemon
     in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon pulls
     pages from the migration source and services the IO request.

     pros
     - After the page transfer is complete, everything is same as normal case.
     - no new device driver isn't needed
     cons
     - future improvement would be difficult
     - administration: setting up nbd, swap device

Thanks in advance

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html