On Thu, Sep 30, 2021 at 16:17:44 -0400, Laine Stump wrote: > On 9/30/21 1:09 PM, Laurent Vivier wrote: > > If we want to save a snapshot of a VM to a file, we used to follow the > > following steps: > > > > 1- stop the VM: > > (qemu) stop > > > > 2- migrate the VM to a file: > > (qemu) migrate "exec:cat > snapshot" > > > > 3- resume the VM: > > (qemu) cont > > > > After that we can restore the snapshot with: > > qemu-system-x86_64 ... -incoming "exec:cat snapshot" > > (qemu) cont > > This is the basics of what libvirt does for a snapshot, and steps 1+2 are > what it does for a "managedsave" (where it saves the snapshot to disk and > then terminates the qemu process, for later re-animation). > > In those cases, it seems like this new parameter could work for us - instead > of explicitly pausing the guest prior to migrating it to disk, we would set > this new parameter to on, then directly migrate-to-disk (relying on qemu to > do the pause). Care will need to be taken to assure that error recovery > behaves the same though. Yup, see below ... > There are a couple of cases when libvirt apparently *doesn't* pause the > guest during the migrate-to-disk, both having to do with saving a coredump > of the guest. Since I really have no idea of how common/important that is In most cases when doing a coredump the guest is paused because of an emulation/guest error. One example where the VM is not paused is a 'live' snapshot. It wastes disk space and is not commonly used thoug. Where it might become interesting is with the 'background-snapshot' migration flag. Ideally failover will be fixed to properly work with that one too. In such case we don't want to pause the VM (but we have to AFAIK, the backround-snapshot migration can't be done as part of 'transacetion' yet, so we need to pause the VM to kick off the migration (memory-snapshot) and then snapshot the disks). > (or even if my assessment of the code is correct), I'm Cc'ing this patch to > libvir-list to make sure it catches the attention of someone who knows the > answers and implications. Well cc-ing relevant patches to libvirt is always good. Especially if we'll need to adapt the code to support the new feature. > > But when failover is configured, it doesn't work anymore. > > > > As the failover needs to ask the guest OS to unplug the card > > the machine cannot be paused. > > > > This patch introduces a new migration parameter, "pause-vm", that > > asks the migration to pause the VM during the migration startup > > phase after the the card is unplugged. Is there a time limit to this? If guest interaction is required it might take unbounded time. In case of snapshots the expectation from the user is that the state capture happens "reasonably" immediately after issuing the command. If we introduce an possibly unbounded wait time, it will need an re-imagining of the snapshot workflow and the feature will need to be an opt-in. > > > > Once the migration is done, we only need to resume the VM with > > "cont" and the card is plugged back: > > > > 1- set the parameter: > > (qemu) migrate_set_parameter pause-vm on > > > > 2- migrate the VM to a file: > > (qemu) migrate "exec:cat > snapshot" > > > > The primary failover card (VFIO) is unplugged and the VM is paused. > > > > 3- resume the VM: > > (qemu) cont > > > > The VM restarts and the primary failover card is plugged back > > > > The VM state sent in the migration stream is "paused", it means > > when the snapshot is loaded or if the stream is sent to a destination > > QEMU, the VM needs to be resumed manually. This is not a problem, libvirt is already dealing with this internally anyways.