Re: Authoritative info on backup-begin versus snapshots/other state capture

Peter Krempa <pkrempa@xxxxxxxxxx> · Fri, 17 Jan 2025 12:28:30 +0100

On Fri, Jan 17, 2025 at 00:29:55 -0000, camccuk--- via Users wrote:
> This is really helpful, thanks.
> 
> > The disk quiescing is not part of the backup operation and needs to be
> > done manually via 'virsh domfsfreeze' if required. The original
> 
> I assume quiescing *would* be necessary for workloads like databases

So normally the quiescing restricts writes to the device and fluses
filesystem caches inside the guest OS.

In addition the guest agent should allow you to register scripts which
are executed before the FS is quiesced allowing e.g. database memory
state to be flused to disk so that also the application data is
consistent.

> and if we can live with a crash-consistent backup then we can bypass >

The application consistency mentioned above is extra important for the
use of the backup API or disk-only snapshots as using the saved state is
equivalent to pulling out the power plug of a real machine.

> this, but if I was to include this, the sequence would be:
> 
> virsh domfsfreeze <domain-name>
> virsh backup-begin <domain-name>
> virsh domfsthaw <domain-name>
> 
> Again, I assume the qemu-agent would need to be running on the guest to allow freeze/thaw.

Yes the guest agent is needed as this operation actually happens inside
the guest OS.

> 
> I was about to ask how backup-begin is different from creating a disk-only, no-metadata snapshot but I think it is equivalent - the advantage is that we don't need to deal with merging the overlay file and pivoting afterwards, is that right?

So the basic 'push' mode of doing a full backup is indeed semantically
equivalent of creating a disk-only, no-metadata snapshot, then copying
out the data to a standalone image and then merging the overlay back.

The backup API though also allows tracking differences since the last
backup and creating an incremental backup which would be a thinner image
of only the differences.

Additionally the backup API also allows PULL mode when an NBD connection
to an application doing the backup of the actual blocks is used.

> 
> I also realised this is very like the sequence described at the bottom of that domainstatecapture page comparing 'direct backup' and 'Backup via temporary snapshot' - what confused me there and which I still don't understand are the two references to events. For direct backup, this step is:
> - wait for push mode event, or pull data over NBD # most time spent here
> 
> Can you expand this any? I am assuming direct backup is a 'push' mode backup as per the description at https://libvirt.org/kbase/live_full_disk_backup.html - what is this push mode event? 

So the backup operation is potentially long-running if you're backing up
a huge disk. The 'virsh backup-begin' kicks of the operation and returns
right away, while the backup progresses on the background.

In push mode when qemu is writing the backup image the job is running
while data is written, after it finishes an event is fired to clients
listening for it notifying that the job is complete and the output images
are finished.

Note that the state of the backup will still correspond to the point in
time when the operation was *started*, even when the guest OS overwrites
any blocks subsequently.

For a pull mode backup the client doing the backup knows when it's
ready so the job is not auto-finished (which would fire the event) but
rather needs to be terminated manually.

> > By default a full backup creates a stand-alone image. If you'd use
> > incremental backups, then it is actually creating images that depend on
> > each other.
> 
> OK, and that would be by populating an appropriate xml as per https://libvirt.org/formatbackup.html - which I think you answered on this list a year or two ago.
> 
> > Yes it is. Note though that since the VM was likely running at the point
> > when you took the backup the 'restore' operation will look like a
> > cold-boot after a power failure at the exact time when the backup was
> > taken.
> > 
> > 
> > Snapshots also allow you to capture memory state and also pre-date
> > backups thus they are documented a bit more in depth.
> 
> OK - just to make this explicit - if we want to capture memory state as well as disk then we *must* use snapshots, either internal or external?

Yes exactly, currently only snapshots allow memory state capture
synchronized with disk state capture.

> 
> And - last question! - while we are covering the bases... managedsave sounds like it is designed for preserving a one-off recovery position for a potentially relatively long outage such as a hypervisor restart. VM restart will pick up just this latest saved image, but it *will* capture memory also?

A (managed)-save saves only the memory state to an image, disk images
are kept as they are. No preservation points for the disks are created.

Resuming from the (managed)-save will continue using/modifying the disk
image without the possibility of getting back.

It is indeed meant to e.g. preserve the state of VM while the host OS
reboots.

'managed' is in brackeds as there is also a non-managed save.

> 
> Once again thanks for your clarifications - it's clearing up a lot of confusion for me.
>