Re: [PATCH 4/6] Added domainMigrateStartPostCopy in qemu driver

Cristian KLEIN <cristian.klein@xxxxxxxxx> · Thu, 25 Sep 2014 14:56:20 +0200

On 2014-09-25 14:20, Daniel P. Berrange wrote:
On Thu, Sep 25, 2014 at 02:12:24PM +0200, Jiri Denemark wrote:
On Thu, Sep 25, 2014 at 12:00:41 +0200, Cristian KLEIN wrote:
On 2014-09-24 15:06, Jiri Denemark wrote:
This mostly looks good in isolation but I think this is not going to
work. When post-copy is started, QEMU on the destination host will be
resumed (I'm not sure if that happens automatically or we have to do
it), which basically means we need to jump out of the Perform state and
call Finish and once it returns, we should keep waiting for the
post-copy migration to finish in Confirm state and kill the domain at
the end. It's certainly possible the steps we need to do are a bit
different since I'm not familiar with all the details about post-copy
migration, but I believe we need to do something. And just running a
single QEMU command is not enough to start post-copy in libvirt.

I'm not sure to follow. I tested the patch and it worked well: A VM that
was "unmigratable" with pre-copy was successfully migrated through
post-copy. Through the migration protocol, once we start post-copy on
the source qemu, the following will happen:

- source qemu suspends VM and transfer CPU state;
- destination qemu resumes the VM.

Hmm, that's a bit unfortunate. I think we will need a way to tell QEMU
not to resume the CPU automatically. The process should flow as follows:

- libvirt sends migrate-start-postcopy command to QEMU
- QEMU suspends the VM and transfers CPU state
- QEMU tells us we can resume the destination
- libvirt tells the destination QEMU to resume the VM
- libvirt waits until migration is done
- libvirt kills the source QEMU

Perhaps, we could tell the destination QEMU to resume the VM while the
source is transferring CPU state if that's allowed by QEMU to minimize
downtime.

Could you tell me why you think it's necessary to jump out of Perform
state? What is libvirt doing when calling Finish that the destination VM
requires to function properly?

The problem is Finish does more than just resuming the VM on the
destination. Before resuming the VM, libvirt needs to transfer locks on
resources from the source to the destination, it needs to enable
networking for the destination QEMU, etc. Without all this, the VM won't
be able to really work on the destination. Not to mention that if
something fails while the VM is already resumed on the destination, the
code in Perform phase would just abort the migration and resume the VM
on the source, which is wrong. We need to kill both ends since non of
them has the complete state to be able to continue running the VM.

BTW, it's going to work in simple cases, when there's no lock daemon in
use, only basic Linux bridge support is used, etc., which is why it
works just fine for you. But we need to count with all the non-simple
cases too.

Yes, having this work correctly with virtlockd and sanlock is really
mandatory for including the code.

Thanks for pointing this out. (I had a feeling I was missing something.) 
I'll study the libvirt code and see how this could be nicely integrated.

--
Cristian Klein, PhD
Post-doc @ Umeå Universitet
http://www8.cs.umu.se/~cklein

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list