On Thu, Sep 25, 2014 at 12:00:41 +0200, Cristian KLEIN wrote: > On 2014-09-24 15:06, Jiri Denemark wrote: > > This mostly looks good in isolation but I think this is not going to > > work. When post-copy is started, QEMU on the destination host will be > > resumed (I'm not sure if that happens automatically or we have to do > > it), which basically means we need to jump out of the Perform state and > > call Finish and once it returns, we should keep waiting for the > > post-copy migration to finish in Confirm state and kill the domain at > > the end. It's certainly possible the steps we need to do are a bit > > different since I'm not familiar with all the details about post-copy > > migration, but I believe we need to do something. And just running a > > single QEMU command is not enough to start post-copy in libvirt. > > I'm not sure to follow. I tested the patch and it worked well: A VM that > was "unmigratable" with pre-copy was successfully migrated through > post-copy. Through the migration protocol, once we start post-copy on > the source qemu, the following will happen: > > - source qemu suspends VM and transfer CPU state; > - destination qemu resumes the VM. Hmm, that's a bit unfortunate. I think we will need a way to tell QEMU not to resume the CPU automatically. The process should flow as follows: - libvirt sends migrate-start-postcopy command to QEMU - QEMU suspends the VM and transfers CPU state - QEMU tells us we can resume the destination - libvirt tells the destination QEMU to resume the VM - libvirt waits until migration is done - libvirt kills the source QEMU Perhaps, we could tell the destination QEMU to resume the VM while the source is transferring CPU state if that's allowed by QEMU to minimize downtime. > Could you tell me why you think it's necessary to jump out of Perform > state? What is libvirt doing when calling Finish that the destination VM > requires to function properly? The problem is Finish does more than just resuming the VM on the destination. Before resuming the VM, libvirt needs to transfer locks on resources from the source to the destination, it needs to enable networking for the destination QEMU, etc. Without all this, the VM won't be able to really work on the destination. Not to mention that if something fails while the VM is already resumed on the destination, the code in Perform phase would just abort the migration and resume the VM on the source, which is wrong. We need to kill both ends since non of them has the complete state to be able to continue running the VM. BTW, it's going to work in simple cases, when there's no lock daemon in use, only basic Linux bridge support is used, etc., which is why it works just fine for you. But we need to count with all the non-simple cases too. Jirka -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list