Re: [Qemu-devel] [RFC] postcopy livemigration proposal

Anthony Liguori <anthony@xxxxxxxxxxxxx> · Mon, 08 Aug 2011 16:42:33 -0500

On 08/08/2011 04:40 AM, Yaniv Kaul wrote:
On 08/08/2011 12:20, Dor Laor wrote:
On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

Design/Implementation
=====================
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
- start migration
stop the guest VM on the source and send the machine states except
guest RAM to the destination
- resume the guest VM on the destination without guest RAM contents
- Hook guest access to pages, and pull page contents from the source
This continues until all the pages are pulled to the destination

The big picture is depicted at
http://wiki.qemu.org/File:Postcopy-livemigration.png

That's terrific (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.
- Efficient, reduce needed traffic no need to re-send pages.
- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.

Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.

I highly doubt that's acceptable in enterprise deployments.

I don't think you can make blanket statements about enterprise deployments.

A lot of enterprises are increasingly building fault tolerance into 
their applications expecting that the underlying hardware will fail. 
With cloud environments like EC2 that experience failure on a pretty 
regular basis, this is just becoming all the more common.

So I really don't view this as a critical issue.  It certainly would be 
if it were the only mechanism available but as long as we can also 
support pre-copy migration it would be fine.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html