* Paolo Bonzini (pbonzini@xxxxxxxxxx) wrote: > Il 28/03/2014 10:18, Gonglei (Arei) ha scritto: > >>> > Can you please give more details at how you are triggering the problem > >>> > with libvirt? I think Paolo is probably right - the bug is more likely > >>> > to be in libvirt not expecting the race and not recovering correctly > >>> > when the race occurs, than it is to be in changing qemu's state algorithm. > >>> > > >>When the migration progress reaches 100%, and the migration status becomes > >>MIG_STATE_COMPLETED in Qemu. > >>It will take some time which from MIG_STATE_COMPLETED to the migration > >>thread resources are recovered. > >>If we cancel the migration at this moment, the migrate_fd_cancel function will > >>break directly without reporting > >>error code. Then, libvirt considers the cancle operation a success, contrary > >>facts. > > There is no error, once migration is completed you can still > shutdown on the destination and continue on the source. Libvirt > should either: (I've rewritten my reply below about 4 times - swinging between different answers, this stuff really isn't obvious, and certainly not documented) I think I agree that it's not an error; but I think migrate_fd_cancel knows what the outcome will be. If it was MIG_STATE_ERROR on entry to migrate_fd_cancel, then yes it could tell you that the cancel failed because you were already in error. If it was MIG_STATE_COMPLETED on entry to migrate_fd_cancel, then yes it could tell you that the cancel failed because you already finished. If it was MIG_STATE_ACTIVE on entry to migrate_fd_cancel - it will go to MIG_STATE_CANCELLING and I believe eventually to MIG_STATE_CANCELLED; I don't believe it can get to MIG_STATE_ERROR from that point, since all of the places in the migrate_thread that transition to error do explicit ACTIVE->ERROR transitions. I don't believe it can get to MIG_STATE_COMPLETED for the same reason. So migrate_fd_cancel knows that the eventual outcome will be Error or Cancelled or completed, even if the state isn't there yet, and it could reply to say that. > 1) poll with "query-migrate" after migrate_cancel, and report an > error there if it's the desired semantics; > 2) toggle a "cancelled" flag before asking QEMU to cancel migration, > check it in the migration functions after "query-migrate" reported > completion; if it is true, do not resume on the destination. I think you're right you have to poll with query-migrate until you get one of cancelled/failed/completed. However it's a bit odd; prior to the introduction of 'CANCELLING', the state that you would get by a query-migrate after migrate_fd_cancel returned would in principal be the state you ended up in - i.e. cancelled/failed/completed. With cancelling added, query-migrate might lie to you and say 'active' (when it's really hiding the fact that cancelling is happening). So while 'cancelling' apparently didn't alter the API it did, in that query-migrate after a cancel can now return active where it couldn't before. > Another reason for doing it in libvirt is that the serialization > between cancellation and completion of migration ultimately is > controlled by libvirt's lock. Doing this in QEMU makes it harder to > reason about concurrency. I think you have to be careful when you talk about 'cancellation and completion of migration' - in that paragraph I don't think you mean the same thing as MIG_STATE_CANCELLED and MIG_STATE_COMPLETED, I think you're talking about the larger scale idea of completion after you take into account that the VM might be paused after qemu has gone to MIG_STATE_COMPLETED and libvirt might still decide it wants to give up and use the version on the source that's still paused. Dave -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list