Re: Snapshot operation aborted and volume usage

Liran Rotenberg <lrotenbe@xxxxxxxxxx> · Thu, 11 Mar 2021 17:53:11 +0200

On Thu, Mar 11, 2021 at 3:24 PM Peter Krempa <pkrempa@xxxxxxxxxx> wrote:
>
> On Thu, Mar 11, 2021 at 10:51:13 +0200, Liran Rotenberg wrote:
> > We recently had this bug[1]. The thought that came from it is the handling
> > of error code after running virDomainSnapshotCreateXML, we encountered
> > VIR_ERR_OPERATION_ABORTED(78).
>
> VIR_ERR_OPERATION_ABORTED is an error code which is emitted by the
> migration code only. That means that the error comes from the failure to
> take a memory image/snapshot of the VM.
>
> Quick skim through the bugreport seems to mention timeout, so your code
> probably aborted the snapshot if it was taking too long.
>
> > Apparently, the new volume is in use. Are there cases where this will
> > happen and the new volume won't appear in the volumes chain? Can we detect
> > / know when?
>
> In the vast majority of cases if virDomainSnapshotCreateXML returns
> failure the new disk volumes are NOT used at that point.
>
> Libvirt tries very hard to ensure that everything is atomic. The memory
> snapshot is taken before installing volumes into the backing chain, so
> if that one fails we don't even attempt to do anything with the disks.
>
> There are three extremely unlikely reasons where the snapshot API returns
> failure and new images were already installed into the backing chain:
>
> 1) resuming of the VM failed after snapshot
> 2) thawing (domfsthaw) of filesystems has failed
>     (easily avoided by not using the _QUIESCE flag, but freezing
>     manually)
> 3) saving of the internal VM state XML failed
>
> Any error except those above can happen only if the images werent
> installed or the VM died while installing the images.
>
> In addition if resuming the cpus after the snapshot fails, the cpus
> didn't run so the guest couldn't have written anything to the image.
> Since snapshot is supposed to flush qemu caches, in case you destroy the
> VM without running the vcpus it's safe to discard the overlays as guest
> didn't write anything into them yet.
>
> > Thinking aloud, if we can detect such cases we can prevent rolling back by
> > reporting it back from VDSM to ovirt. Or, if it can't be detected to go on
> > the safe side in order to save data corruption and prevent the rollback as
> > well.
>
> In general, except for the case when saving of the guest XML has failed,
> the new disk images will not be used by the VM so it's safe to delete
> them.
>
> > Currently, in ovirt, if the job is aborted, we will look into the chain to
> > decide whether to rollback or not.
>
> This is okay, we update the XML only if qemu successfully installed the
> overlays.

Thanks Peter!