Re: [PATCH 2/8] backup: Document nuances between different state capture APIs

Eric Blake <eblake@xxxxxxxxxx> · Tue, 26 Jun 2018 21:44:27 -0500

On 06/26/2018 11:36 AM, Nir Soffer wrote:
On Wed, Jun 13, 2018 at 7:42 PM Eric Blake <eblake@xxxxxxxxxx> wrote:

Upcoming patches will add support for incremental backups via
a new API; but first, we need a landing page that gives an
overview of capturing various pieces of guest state, and which
APIs are best suited to which tasks.

Needs blank line between list items for easier reading of the source.

Sure.

I think we should describe checkpoints before backups, since the
expected flow is:

- user start backup
- system create checkpoint using virDomainCheckpointCreateXML
- system query amount of data pointed by the previous checkpoint
   bitmaps
- system create temporary storage for the backup
- system starts backup using virDomainBackupBegin

I actually think it will be more common to create checkpoints via 
virDomainBackupBegin(), and not virDomainCheckpointCreateXML (the latter 
exists because it is easy, and may have a use independent from 
incremental backups, but it is the former that makes chains of 
incremental backups reliable).

That is, your first backup will be a full backup (no checkpoint as its 
start) but will create a checkpoint at the same time; then your second 
backup is an incremental backup (use the checkpoint created at the first 
backup as the start) and also creates a checkpoint in anticipation of a 
third incremental backup.

You do have an interesting step in there - the ability to query how much 
data is pointed to in the delta between two checkpoints (that is, before 
I actually create a backup, can I pre-guess how much data it will end up 
copying).  On the other hand, the size of the temporary storage for the 
backup is not related to the amount of data tracked in the bitmap. 
Expanding on the examples in my 1/8 reply to you:

At T3, we have:

S1: |AAAA----| <- S2: |---BBB--|
B1: |XXXX----|    B2: |---XXX--|
guest sees: |AAABBB--|

where by T4 we will have:

S1: |AAAA----| <- S2: |D--BBDD-|
B1: |XXXX----|    B2: |---XXX--|
                  B3: |X----XX-|
guest sees: |DAABBDD-|

Back at T3, using B2 as our dirty bitmap, there are two backup models we 
can pursue to get at the data tracked by that bitmap.

The first is push-model backup (blockdev-backup with "sync":"top" to the 
actual backup file) - qemu directly writes the |---BBB--| sequence into 
the destination file (based on the contents of B2), whether or not S2 is 
modified in the meantime; in this mode, qemu is smart enough to not 
bother copying clusters to the destination that were not in the bitmap. 
So the fact that B2 mentions 3 dirty clusters indeed proves to be the 
right size for the destination file.

The second is pull-model backup (blockdev-backup with "sync":"none" to a 
temporary file, coupled with a read-only NBD server on the temporary 
file that also exposes bitmap B2 via NBD_CMD_BLOCK_STATUS) - here, if 
qemu can guarantee that the client would read only dirty clusters, then 
it only has to write to the temporary file if the guest changes a 
cluster that was tracked in B2 (so at most the temporary file would 
contain |-----B--| if the NBD client finishes before T4); but more 
likely, qemu will play conservative and write to the temporary file for 
ANY changes whether or not they are to areas covered by B2 (in which 
case the temporary file could contain |A----B0-| for the three writes 
done by T4).  Or put another way, if qemu can guarantee a nice client, 
then the size of B2 probably overestimates the size of the temporary 
file; but if qemu plays conservative by assuming the client will read 
even portions of the file that weren't dirty, then keeping those reads 
constant will require the temporary file to be as large as the guest is 
able to dirty data while the backup continues, which may be far larger 
than the size of B2.  [And maybe this argues that we want a way for an 
NBD export to force EIO read errors for anything outside of the exported 
dirty bitmap, thus making the client play nice, so that the temporary 
file does not have to grow beyond the size of the bitmap - but that's a 
future feature request]

+    <h2><a id="examples">Examples</a></h2>
+    <p>The following two sequences both capture the disk state of a
+      running guest, then complete with the guest running on its
+      original disk image; but with a difference that an unexpected
+      interruption during the first mode leaves a temporary wrapper
+      file that must be accounted for, while interruption of the
+      second mode has no impact to the guest.</p>

This is not clear, I read this several times and I'm not sure what do
you mean here.

I'm trying to convey the point that with example 1...

Blank line between paragraphs

+    <p>1. Backup via temporary snapshot
+      <pre>
+virDomainFSFreeze()
+virDomainSnapshotCreateXML(VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY)

...if you are interrupted here, your <domain> XML has changed to point 
to the snapshot file...

+virDomainFSThaw()
+third-party copy the backing file to backup storage # most time spent here

+virDomainBlockCommit(VIR_DOMAIN_BLOCK_COMMIT_ACTIVE) per disk
+wait for commit ready event per disk
+virDomainBlockJobAbort() per disk

...and it is not until here that your <domain> XML is back to its 
pre-backup state.  If the backup is interrupted for any reason, you have 
to manually get things back to the pre-backup layout, whether or not you 
were able to salvage the backup data.

+      </pre></p>

I think we should mention virDomainFSFreeze and virDomainFSThaw before
this examples, in the same way we mention the other apis.

Can do.

+
+    <p>2. Direct backup
+      <pre>
+virDomainFSFreeze()
+virDomainBackupBegin()
+virDomainFSThaw()
+wait for push mode event, or pull data over NBD # most time spent here
+virDomainBackeupEnd()

In this example 2, using the new APIs, the <domain> XML is unchanged 
through the entire operation.  If you interrupt things in the middle, 
you may have to scrap the backup data as not being viable, but you don't 
have to do any manual cleanup to get your domain back to the pre-backup 
layout.

+    </pre></p>

This means that virDomainBackupBegin will create a checkpoint, and libvirt
will have to create the temporary storage for the backup (.e.g disk for push
model, or temporary snapshot for the pull model). Libvirt will most likely
use
local storage which may fail if the host does not have enough local storage.

virDomainBackupBegin() has an optional <disks> XML element - if 
provided, then YOU can control the files (the destination on push model, 
ultimately including a remote network destination, such as via NBD, 
gluster, sheepdog, ...; or the scratch file for pull model, which 
probably only makes sense locally as the file gets thrown away as soon 
as the 3rd-party NBD client finishes).  Libvirt only generates a 
filename if you don't provide that level of detail.  You're right that 
the local storage running out of space can be a concern - but also 
remember that incremental backups are designed to be less invasive than 
full backups, AND that if one backup fails, you can then kick off 
another backup using the same checkpoint as starting point as the one 
that failed (that is, when libvirt is using B1 as its basis for a 
backup, but also created B2 at the same time, then you can use 
virDomainCheckpointDelete to remove B2 by merging the B1/B2 bitmaps back 
into B1, with B1 once again tracking changes from the previous 
successful backup to the current point in time).

But this may be good enough for many users, so maybe it is good to
have this.

I think we need to show here the more low level flow that oVirt will use:

Backup using external temporary storage
- virDomainFSFreeze()
- virtDomainCreateCheckpointXML()
- virDomainFSThaw()
- Here oVirt will need to query the checkpoints, to understand how much
   temporary storage is needed for the backup. I hope we have an API
  for this (did not read the next patches yet).

I have not exposed one so far, nor do I know if qemu has that easily 
available.  But since it matters to you, we can make it a priority to 
add that (and the API would need to be added to libvirt.so at the same 
time as the other new APIs, whether or not I can make it in time for the 
freeze at the end of this week).

-  virDomainBackupBegin()
- third party copy data...
- virDomainBackeupEnd()

Again, note that oVirt will probably NOT call 
virDomainCreateCheckpointXML() directly, but will instead do:

virDomainFSFreeze();
virDomainBackupBegin(dom, "<domainbackup type='pull'/>", 
"<domaincheckpoint><name>B1</name></domaincheckpoint>", 0);
virDomainFSThaw();
third party copy data
virDomainBackupEnd();

for the first full backup, then for the next incremental backup, do:

virDomainFSFreeze();
virDomainBackupBegin(dom, "<domainbackup 
type='pull'><incremental>B1</incremental></domainbackup>", 
"domaincheckpoint><name>B2</name></domaincheckpoint>", 0);
virDomainFSThaw();
third party copy data
virDomainBackupEnd();

where you are creating bitmap B2 at the time of the first incremental 
backup (the second backup overall), and that backup consists of the data 
changed since the creation of bitmap B1 at the time of the earlier full 
backup.

Then, as I mentioned earlier, the minimal XML forces libvirt to generate 
filenames (which may or may not match what you want), so you can 
certainly pass in more verbose XML:

<domainbackup type='pull'>
  <incremental>B1</incremental>
  <server transport='unix' socket='/path/to/server'>
  <disks>
    <disk name='vda' type='block'>
      <scratch dev='/path/to/scratch/dev'>
    </disk>
  </disks>
</domainbackup>

and of course, we'll eventually want TLS thrown in the mix (my initial 
implementation has completely bypassed that, other than the fact that 
the <server> element is a great place to stick in the information needed 
for telling qemu's server to only accept clients that know the right TLS 
magic).

If this example helps, I can flush out the html to give these further 
insights.

And, if wrapping FSFreeze/Thaw is that common, we'll probably want to 
reach the point where we add VIR_DOMAIN_BACKUP_QUIESCE as a flag 
argument to automatically do it as part of virDomainBackupBegin().

This is great documentation, showing both the APIs and how they are
used together, we need more of this!

Well, and it's also been a great resource for me as I continue to hammer 
out the (LOADS) of code needed to reach a working demo.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list