Re: [PATCH] failover: allow to pause the VM during the migration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 30/09/2021 22:17, Laine Stump wrote:
On 9/30/21 1:09 PM, Laurent Vivier wrote:
If we want to save a snapshot of a VM to a file, we used to follow the
following steps:

1- stop the VM:
    (qemu) stop

2- migrate the VM to a file:
    (qemu) migrate "exec:cat > snapshot"

3- resume the VM:
    (qemu) cont

After that we can restore the snapshot with:
   qemu-system-x86_64 ... -incoming "exec:cat snapshot"
   (qemu) cont

This is the basics of what libvirt does for a snapshot, and steps 1+2 are what it does for a "managedsave" (where it saves the snapshot to disk and then terminates the qemu process, for later re-animation).

In those cases, it seems like this new parameter could work for us - instead of explicitly pausing the guest prior to migrating it to disk, we would set this new parameter to on, then directly migrate-to-disk (relying on qemu to do the pause). Care will need to be taken to assure that error recovery behaves the same though.

In case of error, the VM is restarted like it's done for a standard migration. I can change that if you need.

An other point is the VM state sent to the migration stream is "paused", it means that machine needs to be resumed after the stream is loaded (from the file or on destination in the case of a real migration), but it can be also changed to be "running" so the machine will be resumed automatically at the end of the file loading (or real migration)

There are a couple of cases when libvirt apparently *doesn't* pause the guest during the migrate-to-disk, both having to do with saving a coredump of the guest. Since I really have no idea of how common/important that is (or even if my assessment of the code is correct), I'm Cc'ing this patch to libvir-list to make sure it catches the attention of someone who knows the answers and implications.

It's an interesting point I need to test and think about: in case of a coredump I guess the machine is crashed and doesn't answer to the unplug request and so the failover unplug cannot be done. For the moment the migration will hang until it is canceled. IT can be annoying if we want to debug the cause of the crash...


But when failover is configured, it doesn't work anymore.

As the failover needs to ask the guest OS to unplug the card
the machine cannot be paused.

This patch introduces a new migration parameter, "pause-vm", that
asks the migration to pause the VM during the migration startup
phase after the the card is unplugged.

Once the migration is done, we only need to resume the VM with
"cont" and the card is plugged back:

1- set the parameter:
    (qemu) migrate_set_parameter pause-vm on

2- migrate the VM to a file:
    (qemu) migrate "exec:cat > snapshot"

    The primary failover card (VFIO) is unplugged and the VM is paused.

3- resume the VM:
    (qemu) cont

    The VM restarts and the primary failover card is plugged back

The VM state sent in the migration stream is "paused", it means
when the snapshot is loaded or if the stream is sent to a destination
QEMU, the VM needs to be resumed manually.

Signed-off-by: Laurent Vivier <lvivier@xxxxxxxxxx>
---
  qapi/migration.json            | 20 +++++++++++++++---
  include/hw/virtio/virtio-net.h |  1 +
  hw/net/virtio-net.c            | 33 ++++++++++++++++++++++++++++++
  migration/migration.c          | 37 +++++++++++++++++++++++++++++++++-
  monitor/hmp-cmds.c             |  8 ++++++++
  5 files changed, 95 insertions(+), 4 deletions(-)

...

Thanks,
Laurent




[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux