Re: Overview of libvirt incremental backup API, part 2 (incremental/differential pull mode)

Nir Soffer <nsoffer@xxxxxxxxxx> · Tue, 9 Oct 2018 16:29:36 +0300

On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <eblake@xxxxxxxxxx> wrote:
On 10/4/18 12:05 AM, Eric Blake wrote:

> The following (long) email describes a portion of the work-flow of how 

> my proposed incremental backup APIs will work, along with the backend 

> QMP commands that each one executes.  I will reply to this thread with 

> further examples (the first example is long enough to be its own email). 

> This is an update to a thread last posted here:

> https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html

> 

> More to come in part 2.

> 

- Second example: a sequence of incremental backups via pull model

In the first example, we did not create a checkpoint at the time of the 

full pull. That means we have no way to track a delta of changes since 

that point in time. 

Why do we want to support backup without creating a checkpoint?

If we don't have any real use case, I suggest to always require a checkpoint.

Let's repeat the full backup (reusing the same 

backup.xml from before), but this time, we'll add a new parameter, a 

second XML file for describing the checkpoint we want to create.

Actually, it was easy enough to get virsh to write the XML for me 

(because it was very similar to existing code in virsh that creates XML 

for snapshot creation):

$ $virsh checkpoint-create-as --print-xml $dom check1 testing \

    --diskspec sdc --diskspec sdd | tee check1.xml

<domaincheckpoint>

   <name>check1</name>

We should use an id, not a name, even of name is name is also unique like
in most libvirt apis.

In RHV we will use always use a UUID for this.

   <description>testing</description>

   <disks>

     <disk name='sdc'/>

     <disk name='sdd'/>

   </disks>

</domaincheckpoint>

I had to supply two --diskspec arguments to virsh to select just the two 

qcow2 disks that I am using in my example (rather than every disk in the 

domain, which is the default when <disks> is not present). 

So <disks /> is valid configuration, selecting all disks, or not having "disks" element
selects all disks?

I also picked 

a name (mandatory) and description (optional) to be associated with the 

checkpoint.

The backup.xml file that we plan to reuse still mentions scratch1.img 

and scratch2.img as files needed for staging the pull request. However, 

any contents in those files could interfere with our second backup 

(after all, every cluster written into that file from the first backup 

represents a point in time that was frozen at the first backup; but our 

second backup will want to read the data as the guest sees it now rather 

than what it was at the first backup), so we MUST regenerate the scratch 

files. (Perhaps I should have just deleted them at the end of example 1 

in my previous email, had I remembered when typing that mail).

$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img

$ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img

Now, to begin the full backup and create a checkpoint at the same time. 

Also, this time around, it would be nice if the guest had a chance to 

freeze I/O to the disks prior to the point chosen as the checkpoint. 

Assuming the guest is trusted, and running the qemu guest agent (qga), 

we can do that with:

$ $virsh fsfreeze $dom

$ $virsh backup-begin $dom backup.xml check1.xml

Backup id 1 started

backup used description from 'backup.xml'

checkpoint used description from 'check1.xml'

$ $virsh fsthaw $dom

Great, this answer my (unsent) question about freeze/thaw from part 1 :-) 

and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE 

flag to combine those three steps into a single API (matching what we've 

done on some other existing API).  In other words, the sequence of QMP 

operations performed during virDomainBackupBegin are quick enough that 

they won't stall a freeze operation (at least Windows is picky if you 

stall a freeze operation longer than 10 seconds).

We use fsFreeze/fsThaw directly in RHV since we need to support external
snapshots (e.g. ceph), so we don't need this functionality, but it sounds good
idea to make it work like snapshot.

The tweaked $virsh backup-begin now results in a call to:

  virDomainBackupBegin(dom, "<domainbackup ...>",

    "<domaincheckpoint ...", 0)

and in turn libvirt makes a similar sequence of QMP calls as before, 

with a slight modification in the middle:

{"execute":"nbd-server-start",...

{"execute":"blockdev-add",...

This does not work yet for network disks like "rbd" and "glusterfs"
does it mean that they will not be supported for backup?

{"execute":"transaction",

  "arguments":{"actions":[

   {"type":"blockdev-backup", "data":{

    "device":"$node1", "target":"backup-sdc", "sync":"none",

    "job-id":"backup-sdc" }},

   {"type":"blockdev-backup", "data":{

    "device":"$node2", "target":"backup-sdd", "sync":"none",

    "job-id":"backup-sdd" }}

   {"type":"block-dirty-bitmap-add", "data":{

    "node":"$node1", "name":"check1", "persistent":true}},

   {"type":"block-dirty-bitmap-add", "data":{

    "node":"$node2", "name":"check1", "persistent":true}}

  ]}}

{"execute":"nbd-server-add",...

What if this sequence fail in the middle? will libvirt handle all failures
and rollback to the previous state?

What is the semantics of "execute": "transaction"? does it mean that qemu
will handle all possible failures in one of the actions?

(Will continue later)

The only change was adding more actions to the "transaction" command - 

in addition to kicking off the fleece image in the scratch nodes, it 

ALSO added a persistent bitmap to each of the original images, to track 

all changes made after the point of the transaction.  The bitmaps are 

persistent - at this point (well, it's better if you wait until after 

backup-end), you could shut the guest down and restart it, and libvirt 

will still remember that the checkpoint exists, and qemu will continue 

track guest writes via the bitmap. However, the backup job itself is 

currently live-only, and shutting down the guest while a backup 

operation is in effect will lose track of the backup job.  What that 

really means is that if the guest shuts down, your current backup job is 

hosed (you cannot ever get back the point-in-time data from your API 

request - as your next API request will be a new point in time) - but 

you have not permanently ruined the guest, and your recovery is to just 

start a new backup.

Pulling the data out from the backup is unchanged from example 1; virsh 

backup-dumpxml will show details about the job (yes, the job id is still 

1 for now), and when ready, virsh backup-end will end the job and 

gracefully take down the NBD server with no difference in QMP commands 

from before.  Thus, the creation of a checkpoint didn't change any of 

the fundamentals of capturing the current backup, but rather is in 

preparation for the next step.

$ $virsh backup-end $dom 1

Backup id 1 completed

$ rm scratch1.img scratch2.img

[We have not yet designed how qemu bitmaps will interact with external 

snapshots - but I see two likely scenarios:

  1. Down the road, I add a virDomainSnapshotCheckpointCreateXML() API, 

which adds a checkpointXML parameter but otherwise behaves like the 

existing virDomainSnapshotCreateXML - if that API is added in a 

different release than my current API proposals, that's yet another 

libvirt.so rebase to pickup the new API.

  2. My current proposal of virDomainBackupBegin(dom, "<domainbackup>", 

"<domaincheckpoint>", flags) could instead be tweaked to a single XML 

parameter, virDomainBackupBegin(dom, "

<domainbackup>

   <domaincheckpoint> ... </domaincheckpoint>

</domainbackup>", flags) prior to adding my APIs to libvirt 4.9, then 

down the road, we also tweak <domainsnapshot> to take an optional 

<domaincheckpoint> sub-element, and thus reuse the existing 

virDomainSnapshotCreateXML() to now also create checkpoints without a 

further API addition.

Speak up now if you have a preference between the two ideas]

Now that we have concluded the full backup and created a checkpoint, we 

can do more things with the checkpoint (it is persistent, after all). 

For example:

$ $virsh checkpoint-list $dom

  Name                 Creation Time

--------------------------------------------

  check1               2018-10-04 15:02:24 -0500

called virDomainListCheckpoints(dom, &array, 0) under the hood to get a 

list of virDomainCheckpointPtr objects, then called 

virDomainCheckpointGetXMLDesc(array[0], 0) to scrape the XML describing 

that checkpoint in order to display information.  Or another approach, 

using virDomainCheckpointGetXMLDesc(virDomainCheckpointCurrent(dom, 0), 0):

$ $virsh checkpoint-current $dom | head

<domaincheckpoint>

   <name>check1</name>

   <description>testing</description>

   <creationTime>1538683344</creationTime>

   <disks>

     <disk name='vda' checkpoint='no'/>

     <disk name='sdc' checkpoint='bitmap' bitmap='check1'/>

     <disk name='sdd' checkpoint='bitmap' bitmap='check1'/>

   </disks>

   <domain type='kvm'>

which shows the current checkpoint (that is, the checkpoint owning the 

bitmap that is still receiving live updates), and which bitmap names in 

the qcow2 files are in use. For convenience, it also recorded the full 

<domain> description at the time the checkpoint was captured (I used 

head to limit the size of this email), so that if you later hot-plug 

things, you still have a record of what state the machine had at the 

time the checkpoint was created.

The XML output of a checkpoint description is normally static, but 

sometimes it is useful to know an approximate size of the guest data 

that has been dirtied since a checkpoint was created (a dynamic value 

that grows as a guest dirties more clusters).  For that, it makes sense 

to have a flag to request the dynamic data; it's also useful to have a 

flag that suppresses the (length) <domain> output:

$ $virsh checkpoint-current $dom --size --no-domain

<domaincheckpoint>

   <name>check1</name>

   <description>testing</description>

   <creationTime>1538683344</creationTime>

   <disks>

     <disk name='vda' checkpoint='no'/>

     <disk name='sdc' checkpoint='bitmap' bitmap='check1' size='1048576'/>

     <disk name='sdd' checkpoint='bitmap' bitmap='check1' size='65536'/>

   </disks>

</domaincheckpoint>

This maps to virDomainCheckpointGetXMLDesc(chk, 

VIR_DOMAIN_CHECKPOINT_XML_NO_DOMAIN | VIR_DOMAIN_CHECKPOINT_XML_SIZE). 

Under the hood, libvirt calls

{"execute":"query-block"}

and converts the bitmap size reported by qemu into an estimate of the 

number of bytes that would be required if you were to start a backup 

from that checkpoint right now.  Note that the result is just an 

estimate of the storage taken by guest-visible data; you'll probably 

want to use 'qemu-img measure' to convert that into a size of how much a 

matching qcow2 image would require when metadata is added in; also 

remember that the number is constantly growing as the guest writes and 

causes more of the image to become dirty.  But having a feel for how 

much has changed can be useful for determining if continuing a chain of 

incremental backups still makes more sense, or if enough of the guest 

data has changed that doing a full backup is smarter; it is also useful 

for preallocating how much storage you will need for an incremental backup.

Technically, libvirt mapping that a checkpoint size request to a single 

{"execute":"query-block"} works only when querying the size of the 

current bitmap. The command also works when querying the cumulative size 

since an older checkpoint, but under the hood, libvirt must juggle 

things to create a temporary bitmap, call a few 

x-block-dirty-bitmap-merge, query the size of that temporary bitmap, 

then clean things back up again (after all, size(A) + size(B) >= 

size(A|B), depending on how many clusters were touched during both A and 

B's tracking of dirty clusters).  Again, a nice benefit of having 

libvirt manage multiple qemu bitmaps under a single libvirt API.

Of course, the real reason we created a checkpoint with our full backup 

is that we want to take an incremental backup next, rather than 

repeatedly taking full backups. For this, we need a one-line 

modification to our backup XML to add an <incremental> element; we also 

want to update our checkpoint XML to start yet another checkpoint when 

we run our first incremental backup.

$ cat > backup.xml <<EOF

<domainbackup mode='pull'>

   <server transport='tcp' name='localhost' port='10809'/>

   <incremental>check1</incremental>

   <disks>

     <disk name='$orig1' type='file'>

       <scratch file='$PWD/scratch1.img'/>

     </disk>

     <disk name='sdd' type='file'>

       <scratch file='$PWD/scratch2.img'/>

     </disk>

   </disks>

</domainbackup>

EOF

$ $virsh checkpoint-create-as --print-xml $dom check2 \

    --diskspec sdc --diskspec sdd | tee check2.xml

<domaincheckpoint>

   <name>check2</name>

   <disks>

     <disk name='sdc'/>

     <disk name='sdd'/>

   </disks>

</domaincheckpoint>

$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img

$ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img

And again, it's time to kick off the backup job:

$ $virsh backup-begin $dom backup.xml check2.xml

Backup id 1 started

backup used description from 'backup.xml'

checkpoint used description from 'check2.xml'

This time, the incremental backup causes libvirt to do a bit more work 

under the hood:

{"execute":"nbd-server-start",

  "arguments":{"addr":{"type":"inet",

   "data":{"host":"localhost", "port":"10809"}}}}

{"execute":"blockdev-add",

  "arguments":{"driver":"qcow2", "node-name":"backup-sdc",

   "file":{"driver":"file",

    "filename":"$PWD/scratch1.img"},

    "backing":"'$node1'"}}

{"execute":"blockdev-add",

  "arguments":{"driver":"qcow2", "node-name":"backup-sdd",

   "file":{"driver":"file",

    "filename":"$PWD/scratch2.img"},

    "backing":"'$node2'"}}

{"execute":"block-dirty-bitmap-add",

  "arguments":{"node":"$node1", "name":"backup-sdc"}}

{"execute":"x-block-dirty-bitmap-merge",

  "arguments":{"node":"$node1", "src_name":"check1",

  "dst_name":"backup-sdc"}}'

{"execute":"block-dirty-bitmap-add",

  "arguments":{"node":"$node2", "name":"backup-sdd"}}

{"execute":"x-block-dirty-bitmap-merge",

  "arguments":{"node":"$node2", "src_name":"check1",

  "dst_name":"backup-sdd"}}'

{"execute":"transaction",

  "arguments":{"actions":[

   {"type":"blockdev-backup", "data":{

    "device":"$node1", "target":"backup-sdc", "sync":"none",

    "job-id":"backup-sdc" }},

   {"type":"blockdev-backup", "data":{

    "device":"$node2", "target":"backup-sdd", "sync":"none",

    "job-id":"backup-sdd" }},

   {"type":"x-block-dirty-bitmap-disable", "data":{

    "node":"$node1", "name":"backup-sdc"}},

   {"type":"x-block-dirty-bitmap-disable", "data":{

    "node":"$node2", "name":"backup-sdd"}},

   {"type":"x-block-dirty-bitmap-disable", "data":{

    "node":"$node1", "name":"check1"}},

   {"type":"x-block-dirty-bitmap-disable", "data":{

    "node":"$node2", "name":"check1"}},

   {"type":"block-dirty-bitmap-add", "data":{

    "node":"$node1", "name":"check2", "persistent":true}},

   {"type":"block-dirty-bitmap-add", "data":{

    "node":"$node2", "name":"check2", "persistent":true}}

  ]}}

{"execute":"nbd-server-add",

  "arguments":{"device":"backup-sdc", "name":"sdc"}}

{"execute":"nbd-server-add",

  "arguments":{"device":"backup-sdd", "name":"sdd"}}

{"execute":"x-nbd-server-add-bitmap",

  "arguments":{"name":"sdc", "bitmap":"backup-sdc"}}

{"execute":"x-nbd-server-add-bitmap",

  "arguments":{"name":"sdd", "bitmap":"backup-sdd"}}

Two things stand out here, different from the earlier full backup. First 

is that libvirt is now creating a temporary non-persistent bitmap, 

merging all data fom check1 into the temporary, then freezing writes 

into the temporary bitmap during the transaction, and telling NBD to 

expose the bitmap to clients. The second is that since we want this 

backup to start a new checkpoint, we disable the old bitmap and create a 

new one. The two additions are independent - it is possible to create an 

incremental backup [<incremental> in backup XML]) without triggering a 

new checkpoint [presence of non-null checkpoint XML].  In fact, taking 

an incremental backup without creating a checkpoint is effectively doing 

differential backups, where multiple backups started at different times 

each contain all cumulative changes since the same original point in 

time, such that later backups are larger than earlier backups, but you 

no longer have to chain those backups to one another to reconstruct the 

state in any one of the backups).

Now that the pull-model backup job is running, we want to scrape the 

data off the NBD server.  Merely reading nbd://localhost:10809/sdc will 

read the full contents of the disk - but that defeats the purpose of 

using the checkpoint in the first place to reduce the amount of data to 

be backed up. So, let's modify our image-scraping loop from the first 

example, to now have one client utilizing the x-dirty-bitmap command 

line extension to drive other clients.  Note: that extension is marked 

experimental in part because it has screwy semantics: if you use it, you 

can't reliably read any data from the NBD server, but instead can 

interpret 'qemu-img map' output by treating any "data":false lines as 

dirty, and "data":true entries as unchanged.

$ image_opts=driver=nbd,export=sdc,server.type=inet,

$ image_opts+=server.host=localhost,server.port=10809,

$ image_opts+=x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc

$ $qemu_img create -f qcow2 inc12.img $size_of_orig1

$ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \

   inc12.img

$ while read line; do

   [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.false.* ]] ||

     continue

   start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]}

   qemu-io -C -c "r $start $len" -f qcow2 inc12.img

done < <($qemu_img map --output=json --image-opts 

$image_optsdriver=nbd,export=sdc,server.type=inet,server.host=localhost,server.port=10809,x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc)

$ $qemu_img rebase -u -f qcow2 -b '' inc12.img

As captured, inc12.img is an incomplete qcow2 file (it only includes 

clusters touched by the guest since the last incremental or full 

backup); but since we output into a qcow2 file, we can easily repair the 

damage:

$ $qemu_img rebase -u -f qcow2 -F qcow2 -b full1.img inc12.img

creating the qcow2 chain 'full1.img <- inc12.img' that contains 

identical guest-visible contents as would be present in a full backup 

done at the same moment.

Of course, with the backups now captured, we clean up:

$ $virsh backup-end $dom 1

Backup id 1 completed

$ rm scratch1.img scratch2.img

and this time, virDomainBackupEnd() had to do one additional bit of work 

to delete the temporary bitmaps:

{"execute":"nbd-server-remove",

  "arguments":{"name":"sdc"}}

{"execute":"nbd-server-remove",

  "arguments":{"name":"sdd"}}

{"execute":"nbd-server-stop"}

{"execute":"block-job-cancel",

  "arguments":{"device":"backup-sdc"}}

{"execute":"block-job-cancel",

  "arguments":{"device":"backup-sdd"}}

{"execute":"blockdev-del",

  "arguments":{"node-name":"backup-sdc"}}

{"execute":"blockdev-del",

  "arguments":{"node-name":"backup-sdd"}}

{"execute":"block-dirty-bitmap-remove",

  "arguments":{"node":"$node1", "name":"backup-sdc"}}

{"execute":"block-dirty-bitmap-remove",

  "arguments":{"node":"$node2", "name":"backup-sdd"}}

At this point, it should be fairly obvious that you can create more 

incremental backups, by repeatedly updating the <incremental> line in 

backup.xml, and adjusting the checkpoint XML to move on to a successive 

name.  And while incremental backups are the most common (using the 

current active checkpoint as the <incremental> when starting the next), 

the scheme is also set up to permit differential backups from any 

existing checkpoint to the current point in time (since libvirt is 

already creating a temporary bitmap as its basis for the 

x-nbd-server-add-bitmap, all it has to do is just add an appropriate 

number of x-block-dirty-bitmap-merge calls to collect all bitmaps in the 

chain from the requested checkpoint to the current checkpoint).

More to come in part 3.

-- 

Eric Blake, Principal Software Engineer

Red Hat, Inc.           +1-919-301-3266

Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list