Re: [RFC v2] external (pull) backup API

John Snow <jsnow@xxxxxxxxxx> · Thu, 12 Apr 2018 00:14:34 -0400

On 04/11/2018 12:32 PM, Eric Blake wrote:
> On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
>> Hi, all.                                                                                         
>>                                                                                                  
>> This is another RFC on pull backup API. This API provides means to read domain                   
>> disks in a snapshotted state so that client can back them up as well as means                    
>> to write domain disks to revert them to backed up state. The previous version                    
>> of RFC is [1]. I'll also describe the API implementation details to shed light                   
>> on misc qemu dirty bitmap commands usage.                                                        
> 
> This is a first-pass review (making comments as I first encounter
> something, even if it gets explained later in the email)
> 
>>                                                                                                  
>> This API does not use existent disks snapshots. Instead it introduces snapshots                  
>> provided by qemu's blockdev-backup command. The reason is we need snapshotted                    
>> disk state only temporarily for duration of backup operation and newly                           
>> introduced snapshots can be easily discarded at the end of operation without                     
>> block commit operation. Technically difference is next. On usual snapshot we                     
>> create new image backed by original and all new data goes to the new image thus                  
>> original image stays in a snapshotted state. In temporary snapshots we create                    
>> new image backed by original and all new data still goes to the original image                   
>> but before new data is written old data to be overwritten is popped out to the new               
>> image thus we get snapshotted state thru new image.                                              
> 

Oh, I see -- you're using blockdev-backup sync=none to accomplish
fleecing snapshots. It's a little confusing here without the sync=none
information, as usually blockdev-backup provides... backups, not snapshots.

Your cover letter would be a little clearer with that information.

For everyone else:

Node fleecing is a very dumb name that means "Live copying of arbitrary
blocks from a live node." In QEMU, it works like this:

(1) Create a new active layer for the node to be fleeced.

[node] <-- [snapshot]

Note that this snapshot node is backed by the node named "node", not a
file named "node". The snapshot is supported by a qcow2 file on local
storage that starts empty.

(2) Start blockdev-backup sync=none FROM the node TO the snapshot:

> blockdev-backup device=node target=snapshot sync=none

This means that any time 'node' is written to, the information will get
written to 'snapshot' on-demand, preserving snapshot as a point-in-time
snapshot of node, while leaving 'node' in-use for any other
nodes/devices using it. It's effectively the opposite of an external
snapshot, facilitated by a live COW job.

(3) an NBD server may be started to export the "snapshot"

nbd-server-add device=snapshot

(4+) At this point, the NBD client can copy the snapshot data out, and
the NBD export can be closed upon completion. Then, the snapshot can be
removed/deleted.

Unlike traditional external snapshot and commit workflow, this snapshot
can be deleted at any time without jeopardizing the data it is a
snapshot of.

> So, rewriting this to make sure I understand, let's start with a disk
> with contents A, then take a snapshot, then write B:
> 
> In the existing libvirt snapshot APIs, the data gets distributed as:
> 
> base (contents A) <- new active (contents B)
> 
> where you want the new API:
> 
> base, remains active (contents B) ~~~ backup (contents A)
> 
>>                                                                                                  
>> Disks snapshots as well as disks itself are avaiable to read/write thru qemu                     
>> NBD server.                                                                                      
> 
> So the biggest reason for a new libvirt API is that we need management
> actions to control which NBD images from qemu are exposed and torn down
> at the appropriate sequences.
> 
>>                                                                                                  
>> Here is typical actions on domain backup:                                                        
>>                                                                                                  
>> - create temporary snapshot of domain disks of interest                                          
>> - export snaphots thru NBD                                                                       
>> - back them up                                                                                   
>> - remove disks from export                                                                       
>> - delete temporary snapshot                                                                      
>>                                                                                                  
>> and typical actions on domain restore:                                                           
>>                                                                                                  
>> - start domain in paused state                                                                   
>> - export domain disks of interest thru NBD for write                                             
>> - restore them                                                                                   
>> - remove disks from export                                                                       
>> - resume or destroy domain                                                                       
>>                                                                                                  
>> Now let's write down API in more details. There are minor changes in comparison                  
>> with previous version [1].                                                                       
>>                                                                                                  
>>                          
>> *Temporary snapshot API*
>>
>> In previous version it is called 'Fleece API' after qemu terms and I'll still
>> use BlockSnapshot prefix for commands as in previous RFC instead of
>> TmpSnapshots which I inclined more now.
>>
>> virDomainBlockSnapshotPtr
>> virDomainBlockSnapshotCreateXML(virDomainPtr domain,
>>                                 const char *xmlDesc,
>>                                 unsigned int flags);
> 
> Just to make sure, we have the existing API of:
> 
> virDomainSnapshotPtr virDomainSnapshotCreateXML(virDomainPtr domain,
>                                                 const char *xmlDesc,
>                                                 unsigned int flags);
> 
> So you are creating a new object (virDomainBlockSnapshotPtr) rather than
> reusing the existing VirDomainSnapshotPtr, and although the two commands
> are similar, we get to design a new XML schema from scratch rather than
> trying to overload yet even more functionality onto the existing API.
> 
> Should we also have:
> 
> const char *virDomainBlockSnapshotGetName(virDomainBlockSnapshotPtr
> snapshot);
> virDomainPtr virDomainBlockSnapshotGetDomain(virDomainBlockSnapshotPtr
> snapshot);
> virConnectPtr virDomainBlockSnapshotGetConnect(virDomainBlockSnapshotPtr
> snapshot);
> 
> for symmetry with existing snapshot API?
> 
>>
>> virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot,
>>                              unsigned int flags);
>>
>> virDomainBlockSnapshotList(virDomainPtr domain,
>>                            virDomainBlockSnapshotPtr **snaps,
>>                            unsigned int flags);
> 
> I'm guessing this is the counterpart to virDomainListAllSnapshots() (the
> modern listing interface), and that we probably don't want counterparts
> for virDomainSnapshotNum/virDomainSnapshotListNames (the older listing
> interface, which was inherently racy as the list could change in length
> between the two calls).
> 
>>
>> virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot,
>>                                  unsigned int flags);
>>
>> virDomainBlockSnapshotPtr
>> virDomainBlockSnapshotLookupByName(virDomainPtr domain,
>>                                    const char *name,
>>                                    unsigned int flags);
> 
> Also, the virDomainSnapshotPtr had a number of API to track a tree-like
> hierarchy between snapshots (that is, you very much want to know if
> snapshot B is a child of snapshot A), while it looks like your new
> virDomainBlockSnapshotPtrs are completely independent (no relationships
> between the snapshots, each can be independently created or torn down,
> without having to rewrite a relationship tree between them, and there is
> no need for counterparts to things like virDomainSnapshotNumChildren).
> Okay, I think that makes sense, and is a good reason for introducing a
> new object type rather than shoe-horning this into the existing API.
> 
>>
>> Here is an example of snapshot xml description:
>>
>> <domainblocksnapshot>
>>     <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name>
>>     <disk name='sda' type="file">
>>         <fleece file="/tmp/snapshot-a.hdd"/>
>>     </disk>
>>     <disk name='sdb' type="file">
>>         <fleece file="/tmp/snapshot-b.hdd"/>
>>     </disk>
>> </domainblocksnapshot>
>>
>> Temporary snapshots are indepentent thus they are not organized in tree structure
>> as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice.
> 
> So in the XML, the <fleece> element describes the destination file (back
> to my earlier diagram, it would be the file that is created and will
> hold content 'A' when the main active image is changed to hold content
> 'B' after the snapshot was created)?
> 
>>
>> Qemu can track what disk's blocks are changed from snapshotted state so on next
>> backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML
>> accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option
>> for snapshot which means to track changes from this particular snapshot. I used
>> checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used
>> to provide changed blocks from the given checkpoint to current snapshot in
>> current implementation (see *Implemenation* section for more details). Also
>> bitmap keeps block changes and thus itself changes in time and checkpoint is
>> a more statical terms means you can query changes from that moment in time.
>>
>> Checkpoints are visible in active domain xml:
>>
>>     <disk type='file' device='disk'>
>>       ..
>>       <target dev='sda' bus='scsi'/>
>>       <alias name='scsi0-0-0-0'/>
>>       <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178">
>>       <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c">
>>       ..
>>     </disk>
>>

It makes sense to avoid the bitmap name in libvirt, but do these indeed
correlate 1:1 with bitmaps?

I assume each bitmap will have name=%%UUID%% ?

>> Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default
>> dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used.
>> So client need to manage checkpoints and delete unused. Thus next API function:
>>
>>
>> int
>> virDomainBlockCheckpointRemove(virDomainPtr domain,
>>                                const char *name,
>>                                unsigned int flags);
>>
> 
> I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate.
> Maybe it will be more clear when I read the implementation section
> below.  Is the idea that I can't create a BlockSnapshot without first
> having a checkpoint available?  If so, where does that fit in the
> <domainblocksnapshot> XML?
> 
>>
>> *Block export API*
>>
>> I guess it is natural to treat qemu NBD server as a domain device. So
>> we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD
>> server and virDomainUpdateDeviceFlags to add/delete disks to be exported.
> 
> This feels a bit awkward - up to now, attaching a device is something
> visible to the guest, but you are trying to reuse the interface to
> attach something tracked by the domain, but which has no impact to the
> guest.  That is, the guest has no clue whether a block export exists
> pointing to a particular checkpoint, nor does it care.
> 
>> While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags 
>> looks a bit inconvinient so I decided to add a pair of API functions just
>> to add/delete disks to be exported:
>>
>> int
>> virDomainBlockExportStart(virDomainPtr domain,
>>                           const char *xmlDesc,
>>                           unsigned int flags);
>>
>> int
>> virDomainBlockExportStop(virDomainPtr domain,
>>                          const char *xmlDesc,
>>                          unsigned int flags);
>>
>> I guess more appropriate names are virDomainBlockExportAdd and
>> virDomainBlockExportRemove but as I already have a patch series implementing pull
>> backups with these names I would like to keep these names now.
> 
> What does the XML look like in these calls?
> 
>>
>> These names also reflect that in the implementation I decided to start/stop NBD
>> server in a lazy manner. While it is a bit innovative for libvirt API I guess
>> it is convinient because to refer NBD server to add/remove disks to we need to
>> identify it thru it's parameters like type, address etc until we introduce some
>> device id (which does not looks consistent with current libvirt design).
> 
> This just reinforces my thoughts above - is the reason it doesn't make
> sense to assign a device id to the export due to the fact that the
> export is NOT guest-visible?  Does it even belong under the
> "domain/devices/" xpath of the domain XML, or should it be a new sibling
> of <devices> with an xpath of "domain/blockexports/"?
> 
>> So it
>> looks like we have all parameters to start/stop server in the frame of these
>> calls so why have extra API calls just to start/stop server manually. If we
>> later need to have NBD server without disks we can perfectly support
>> virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
>>
>> Here is example of xml to add/remove disks (specifying checkpoint
>> attribute is not needed for removing disks of course):
>>
>> <domainblockexport type="nbd">
>>     <address type="ip" host="0.0.0.0" port="8000"/>
>>     <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
>>                      checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
>>     <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
>>                      checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
>> </domainblockexport>
> 
> So this is the XML you pass to virDomainBlockExportStart, with the goal
> of telling qemu to start or stop an NBD export on the backing chain
> associated with disk "sda", where the export is serving up data tied to
> checkpoint "d068765e-8b50-4d74-9b72-1e55c663cbf8", and which will be
> associated with the destination snapshot file described by the
> <domainblocksnapshot> named "0044757e-1a2d-4c2c-b92f-bb403309bb17"?
> 
> Why is it named <domainblockexport> here, but...
> 
>>
>> And this is how this NBD server will be exposed in domain xml:
>>
>> <devices>
>>     ...
>>     <blockexport type="nbd">
> 
> <blockexport> here?
> 
>>         <address type="ip" host="0.0.0.0" port="8000"/>
>>         <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
>>                          checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"
>>                          exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
> 
> The exportname property is new here compared to the earlier listing - is
> that something that libvirt generates, or that the user chooses?
> 
>>         <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
>>                          checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8
>>                          exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
>>     </blockexport>
>>     ...
>> </devices>
>>
>> *Implementation details from qemu-libvirt interactions POV*
>>
>> 1. Temporary snapshot
>>
>> - create snapshot
> 
> Which libvirt API triggers this action? virDomainBlockSnapshotCreateXML?
> 
>>     - add fleece blockdev backed by disk of interest
>>     - start fleece blockjob which will pop out data to be overwritten to fleece blockdev
>>
>>     {
>>         "execute": "blockdev-add"
>>         "arguments": {
>>             "backing": "drive-scsi0-0-0-0",
>>             "driver": "qcow2",
>>             "file": {
>>                 "driver": "file",
>>                 "filename": "/tmp/snapshot-a.hdd"
> 
> Is qemu creating this file, or is libvirt pre-creating it and qemu just
> opening it?  I guess this is a case where libvirt would want to
> pre-create an empty qcow2 file (either by qemu-img, or by the new
> x-blockdev-create in qemu 2.12)?  Okay, it looks like this file is what
> you listed in the XML for <domainblocksnapshot>, so libvirt is creating
> it.  Does the new file have a backing image, or does it read as
> completely zeroes?
> 

In fleecing workflow, the image can either be created by QEMU or
pre-created by libvirt, but in keeping with best practices libvirt
should probably create it.

It should be an empty qcow2 backed by the current node of interest.

>>             },
>>             "node-name": "snapshot-scsi0-0-0-0"
>>         },
>>     }
> 
> No trailing comma in JSON {}, but it's not too hard to figure out what
> you mean.
> 
>>     {
>>         "execute": "transaction"
>>         "arguments": {
>>             "actions": [
>>                 {
>>                     "type": "blockdev-backup"
>>                     "data": {
>>                         "device": "drive-scsi0-0-0-0",
>>                         "target": "snapshot-scsi0-0-0-0"
>>                         "sync": "none",
>>                     },
>>                 }
>>             ]
> 
> You showed a transaction with only one element; but presumably we are
> using a transaction because if we want to create a point in time for
> multiple disks at once, we need two separate blockdev-backup actions
> joined in the same transaction to cover the two disks.  So this command
> is telling qemu to start using a brand-new qcow2 file as its local
> storage for tracking that a snapshot is being taken, and that point in
> time is the checkpoint?
> 
> Am I correct that you would then tell qemu to export an NBD view of this
> qcow2 snapshot which a third-party client can connect to and use
> NBD_CMD_BLOCK_STATUS to learn which portions of the file contain data
> (that is, which clusters has qemu copied into the backup, because the
> active image has changed them since the checkpoint, but anything not
> dirty in this file is still identical to the last backup?
> 
> Would libvirt ever want to use something other than "sync":"none"?
> 

No; based on how fleecing is implemented in QEMU.

Effectively, "blockdev-backup sync=none" _IS_ the fleecing command for
QEMU, but it requires two steps:

(1) Create a new empty top layer
(2) Sync new writes from the active layer, which is the backing image(!)
for the snapshot.

The empty top layer can be created whenever, but the backup action needs
to happen in a transaction.

In my earlier email I mentioned it'd be nice to have a convenience job
that wrapped up a few steps into one:

A: Create a new top layer
B: Start the fleecing job itself
C: Tied a specified bitmap belonging to the node to the fleecing job
D: Exported the snapshot via NBD

and this would become a proper "fleecing job." It would have to be, of
course, transactionable.

>>         },
>>     }
>>
>> - delete snapshot
>>     - cancel fleece blockjob
>>     - delete fleece blockdev
>>
>>     {
>>         "execute": "block-job-cancel"
>>         "arguments": {
>>             "device": "drive-scsi0-0-0-0"
>>         },
>>     }
>>     {
>>         "execute": "blockdev-del"
>>         "arguments": {
>>             "node-name": "snapshot-scsi0-0-0-0"
>>         },
>>     }
>>
>> 2. Block export
>>
>> - add disks to export
>>     - start NBD server if it is not started
>>     - add disks
>>
>>     {
>>         "execute": "nbd-server-start"
>>         "arguments": {
>>             "addr": {
>>                 "type": "inet"
>>                 "data": {
>>                     "host": "0.0.0.0",
>>                     "port": "49300"
>>                 },
>>             }
>>         },
>>     }
>>     {
>>         "execute": "nbd-server-add"
>>         "arguments": {
>>             "device": "snapshot-scsi0-0-0-0",
>>             "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>             "writable": false
> 
> So this is telling qemu to export the temporary qcow2 image created in
> the point above.  An NBD client would see the export getting
> progressively more blocks with data as the guest continues to write more
> clusters (as qemu has to copy the data from the checkpoint to the
> temporary file before updating the main image with the new data).  If
> the NBD client reads a cluster that has not yet been copied by qemu
> (because the guest has not written to that cluster since the block job
> started), would it see zeroes, or the same data that the guest still sees?
> 

Because the temporary snapshot is backed by the live image, the guest
sees an unchanging set of blocks. In essence, we are utilizing a COW
mechanism to copy the point-in-time data into the snapshot layer while
the backing image remains the live/active layer for the devices
utilizing it.

>>         },
>>     }
>>
>> - remove disks from export
>>     - remove disks
>>     - stop NBD server if there are no disks left
>>
>>     {
>>         "arguments": {
>>             "mode": "hard",
>>             "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8"
>>         },
>>         "execute": "nbd-server-remove"
>>     }
>>     {
>>         "execute": "nbd-server-stop"
>>     }
>>
>> 3. Checkpoints (the most interesting part)
>>

So far so good from the QEMU end AFAICT...

>> First a few facts about qemu dirty bitmaps.
>>
>> Bitmap can be either in active or disable state. In disabled state it does not
>> get changed on guest writes. And oppositely in active state it tracks guest
>> writes. This implementation uses approach with only one active bitmap at
>> a time. This should reduce guest write penalties in the presence of
>> checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes
>> from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap
>> B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2
>> - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and
>> gets most disk change after latest snapshot.

So you are trying to optimize away write penalties if you have, say, ten
bitmaps representing checkpoints so we don't have to record all new
writes to all ten.

This makes sense, and I would have liked to formalize the concept in
QEMU, but response to that idea was very poor at the time.

Also my design was bad :)

>>
>> Getting changed blocks bitmap from some checkpoint in past till current snapshot
>> is quite simple in this scheme. For example if the last snapshot is 7 then
>> to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3,
>> B4, B4 and B6. Merge is just logical OR on bitmap bits.
>>
>> Deleting a checkpoint somewhere in the middle of checkpoint sequence requires
>> merging correspondent bitmap to the previous bitmap in this scheme.
>>

Previous, or next?

Say we've got bitmaps (in chronological order from oldest to newest)

A B C D E F G H

and we want to delete bitmap (or "checkpoint") 'C':

A B   D E F G H

the bitmap representing checkpoint 'D' should now contain the bits that
used to be in 'C', right? That way all the checkpoints still represent
their appropriate points in time.

The only problem comes when you delete a checkpoint on the end and the
bits have nowhere to go:

A B C

A B _

In this case you really do lose a checkpoint -- but depending on how we
annotate this, it may or may not be possible to delete the most recent
checkpoint. Let's assume that the currently active bitmap that doesn't
represent *any* point in time yet (because it's still active and
recording new writes) is noted as 'X':

A B C X

If we delete C now, then, that bitmap can get re-merged into the *active
bitmap* X:

A B _ X

>> We use persitent bitmaps in the implementation. This means upon qemu process
>> termination bitmaps are saved in disks images metadata and restored back on
>> qemu process start. This makes checkpoint a persistent property that is we
>> keep them across domain start/stops. Qemu does not try hard to keep bitmaps.
>> If upon save something goes wrong bitmap is dropped. The same is applied to the
>> migration process too. For backup process it is not critical. If we don't
>> discover a checkpoint we always can make a full backup. Also qemu provides no
>> special means to track order of bitmaps. These facts are critical for
>> implementation with one active bitmap at a time. We need right order of bitmaps upon
>> merge - for snapshot N and block changes from snanpshot K, K < N to N we need
>> to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged
>> is missing we can't calculate desired block changes too.
>>

Right. A missing bitmap anywhere in the sequence invalidates the entire
sequence.

>> So the implementation encode bitmap order in their names. For snapshot A1, bitmap
>> name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming
>> encoding upon domain start we can find out bitmap order and check for missing
>> ones. This complicates a bit bitmap removing though. For example removing
>> a bitmap somewhere in the middle looks like this:
>>
>> - removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1}
>>     - create new bitmap named NAME_{K+1}^NAME_{K-1}      ---. 
>>     - disable new bitmap                                    | This is effectively renaming
>>     - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap    | of bitmap K+1 to comply the naming scheme
>>     - remove bitmap NAME_{K+1}^NAME_{K}                  ___/
>>     - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2}
>>     - remove bitmap NAME_{K}^NAME_{K-1}
>>
>> As you can see we need to change name for bitmap K+1 to keep our bitmap
>> naming scheme. This is done creating new K+1 bitmap with appropriate name
>> and copying old K+1 bitmap into new.
>>

That seems... unfortunate. A record could be kept in libvirt instead,
couldn't it?

A : Bitmap A, Time 12:34:56, Child of (None), Parent of B
B : Bitmap B, Time 23:15:46, Child of A, Parent of (None)

I suppose in this case you can't *reconstruct* this information from the
bitmap stored in the qcow2, which necessitates your naming scheme...

...Still, if you forego this requirement, deleting bitmaps in the middle
becomes fairly easy.

>> So while it is possible to have only one active bitmap at a time it costs
>> some exersices at managment layer. To me it looks like qemu itself is a better
>> place to track bitmaps chain order and consistency.
> 

If this is a hard requirement, it's certainly *easier* to track the
relationship in QEMU ...

> Libvirt is already tracking a tree relationship between internal
> snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track
> that (true, internal snapshots don't get as much attention as external
> snapshots) - but the fact remains that qemu is probably not the best
> place to track relationship between multiple persistent bitmaps, any
> more than it tracks relationships between internal snapshots.  So having
> libvirt track relations between persistent bitmaps is just fine.  Do we
> really have to rename bitmaps in the qcow2 file, or can libvirt track it
> all on its own?
> 

This is a way, really, of storing extra metadata by using the bitmap
name as arbitrary data storage.

I'd say either we promote QEMU to understanding checkpoints, or enhance
libvirt to track what it needs independent of QEMU -- but having to
rename bitmaps smells fishy to me.

> Earlier, you said that the new virDomainBlockSnapshotPtr are
> independent, with no relations between them.  But here, you are wanting
> to keep incremental backups related to one another.
> 

I think the *snapshots*, as temporary objects, are independent and don't
carry a relation to each other.

The *checkpoints* here, however, are persistent and interrelated.

>>
>> Now how exporting bitmaps looks like.
>>
>> - add to export disk snapshot N with changes from checkpoint K
>>     - add fleece blockdev to NBD exports
>>     - create new bitmap T
>>     - disable bitmap T
>>     - merge bitmaps K, K+1, .. N-1 into T

I see; so we compute a new slice based on previous bitmaps and backup
arbitrary from that arbitrary slice.

So "T" is a temporary bitmap meant to be discarded at the conclusion of
the operation, making it much more like a consumable object.

>>     - add bitmap to T to nbd export
>>
>> - remove disk snapshot from export
>>     - remove fleece blockdev from NBD exports
>>     - remove bitmap T
>>

Aha.

>> Here is qemu commands examples for operation with checkpoints, I'll make
>> several snapshots with checkpoints for purpuse of better illustration.
>>
>> - create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint
>>     - same as without checkpoint but additionally add bitmap on fleece blockjob start
>>
>>     ...
>>     {
>>         "execute": "transaction"
>>         "arguments": {
>>             "actions": [
>>                 {
>>                     "type": "blockdev-backup"
>>                     "data": {
>>                         "device": "drive-scsi0-0-0-0",
>>                         "sync": "none",
>>                         "target": "snapshot-scsi0-0-0-0"
>>                     },
>>                 },
>>                 {
>>                     "type": "block-dirty-bitmap-add"
>>                     "data": {
>>                         "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>                         "node": "drive-scsi0-0-0-0",
>>                         "persistent": true
>>                     },
>>                 }
> 

So a checkpoint creates a reference point, but NOT a backup. You are
manually creating checkpoint instances.

In this case, though, you haven't disabled the previous checkpoint's
bitmap (if any?) atomically with the creation of this one...

> Here, the transaction makes sense; you have to create the persistent
> dirty bitmap to track from the same point in time.  The dirty bitmap is
> tied to the active image, not the backup, so that when you create the
> NEXT incremental backup, you have an accurate record of which sectors
> were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
> 
>>             ]
>>         },
>>     }
>>
>> - delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8
>>     - same as without checkpoints
>>
>> - create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint
>>     - same actions as for the first snapshot, but additionally disable the first bitmap
> 
> Again, you're showing the QMP commands that libvirt is issuing; which
> libvirt API calls are driving these actions?
> 
>>
>>     ...
>>     {
>>         "execute": "transaction"
>>         "arguments": {
>>             "actions": [
>>                 {
>>                     "type": "blockdev-backup"
>>                     "data": {
>>                         "device": "drive-scsi0-0-0-0",
>>                         "sync": "none",
>>                         "target": "snapshot-scsi0-0-0-0"
>>                     },
>>                 },
>>                 {
>>                     "type": "x-vz-block-dirty-bitmap-disable"
>>                     "data": {
> 
> Do you have measurements on whether having multiple active bitmaps hurts
> performance?  I'm not yet sure that managing a chain of disabled bitmaps
> (and merging them as needed for restores) is more or less efficient than
> managing multiple bitmaps all the time.  On the other hand, you do have
> a point that restore is a less frequent operation than backup, so making
> backup as lean as possible and putting more work on restore is a
> reasonable tradeoff, even if it adds complexity to the management for
> doing restores.
> 

Depending on the number of checkpoints intended to be kept... we
certainly make no real promises on the efficiency of marking so many.
It's at *least* a linear increase with each checkpoint...

>>                         "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>                         "node": "drive-scsi0-0-0-0"
>>                     },
>>                 },
>>                 {
>>                     "type": "block-dirty-bitmap-add"
>>                     "data": {
>>                         "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>                         "node": "drive-scsi0-0-0-0",
>>                         "persistent": true
>>                     },
>>                 }
>>             ]
>>         },
>>     }
>>

Oh, I see, you handle the "disable old" case here.

>> - delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17
>> - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
>>
>> - add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with
>>   changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8
>>     - same as add export without checkpoint, but aditionally
>>         - form result bitmap
>>         - add bitmap to NBD export
>>
>>     ...
>>     {
>>         "execute": "transaction"
>>         "arguments": {
>>             "actions": [
>>                 {
>>                     "type": "block-dirty-bitmap-add"
>>                     "data": {
>>                         "node": "drive-scsi0-0-0-0",
>>                         "name": "libvirt-__export_temporary__",
>>                         "persistent": false
>>                     },
>>                 },
>>                 {
>>                     "type": "x-vz-block-dirty-bitmap-disable"
>>                     "data": {
>>                         "node": "drive-scsi0-0-0-0"
>>                         "name": "libvirt-__export_temporary__",
>>                     },
>>                 },
>>                 {
>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>                     "data": {
>>                         "node": "drive-scsi0-0-0-0",
>>                         "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8"
>>                         "dst_name": "libvirt-__export_temporary__",
>>                     },
>>                 },
>>                 {
>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>                     "data": {
>>                         "node": "drive-scsi0-0-0-0",
>>                         "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>                         "dst_name": "libvirt-__export_temporary__",
>>                     },
>>                 }
>>             ]
>>         },
>>     }

OK, so in this transaction you add a new temporary bitmap for export,
and merge the contents of two bitmaps into it.

However, it doesn't look like you created a new checkpoint and managed
that handoff here, did you?

>>     {
>>         "execute": "x-vz-nbd-server-add-bitmap"
>>         "arguments": {
>>             "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b"
>>             "bitmap": "libvirt-__export_temporary__",
>>             "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>         },

And then here, once the bitmap and the data is already frozen, it's
actually alright if we add the export at a later point in time.

> 
> Adding a bitmap to a server is would would advertise to the NBD client
> that it can query the
> "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace
> during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
> 

Don't know much about this, I stopped paying attention to the BLOCK
STATUS patches. Is the NBD spec the best way to find out the current
state right now?

(Is there a less technical, briefer overview somewhere, perhaps from a
commit message or a cover letter?)

>>     }
>>
>> -  remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export
>>     - same as without checkpoint but additionally remove temporary bitmap
>>     
>>     ...
>>     {
>>         "arguments": {
>>             "name": "libvirt-__export_temporary__",
>>             "node": "drive-scsi0-0-0-0"
>>         },
>>         "execute": "block-dirty-bitmap-remove"
>>     }
>>

OK, this just deletes the checkpoint. I guess we delete the node and
stop the NBD server too, right?

>> - delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17
>>     (similar operation is described in the section about naming scheme for bitmaps,
>>      with difference that K+1 is N here and thus new bitmap should not be disabled)
> 
> A suggestion on the examples - while UUIDs are nice and handy for
> management tools, they are a pain to type and for humans to quickly
> read.  Is there any way we can document a sample transaction stream with
> all the actors involved (someone issues a libvirt API call XYZ, libvirt
> in turn issues QMP command ABC), and using shorter names that are easier
> to read as humans?
> 

Yeah, A-B-C-D terminology would be nice for the examples. It's fine if
the actual implementation uses UUIDs.

>>
>>     {
>>         "arguments": {
>>             "actions": [
>>                 {
>>                     "type": "block-dirty-bitmap-add"
>>                     "data": {
>>                         "node": "drive-scsi0-0-0-0",
>>                         "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>                         "persistent": true
>>                     },
>>                 },
>>                 {
>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>                     "data": {
>>                         "node": "drive-scsi0-0-0-0",
>>                         "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>                         "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>                     },
>>                 },
>>                 {
>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>                     "data": {
>>                         "node": "drive-scsi0-0-0-0",
>>                         "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1#
>>                         "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>                     },
>>                 },
>>             ]
>>         },
>>         "execute": "transaction"
>>     }
>>     {
>>         "execute": "x-vz-block-dirty-bitmap-remove"
>>         "arguments": {
>>             "node": "drive-scsi0-0-0-0"
>>             "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17",
>>         },
>>     },
>>     {
>>         "execute": "x-vz-block-dirty-bitmap-remove"
>>         "arguments": {
>>             "node": "drive-scsi0-0-0-0"
>>             "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>         },
>>     }
>>
>> Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
>>
>> x-vz-block-dirty-bitmap-remove

We already have this, right? It doesn't even need to be transactionable.

>> x-vz-block-dirty-bitmap-merge

You need this...

>> x-vz-block-dirty-bitmap-disable

And this we had originally but since removed, but can be re-added trivially.

>> x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint)
>> x-vz-nbd-server-add-bitmap
>>

Do my comments make sense? Am I understanding you right so far? I'll try
to offer a competing writeup to make sure we're on the same page with
your proposed design before I waste any time trying to critique it -- in
case I'm misunderstanding you.

Thank you for leading the charge and proposing new APIs for this
feature. It will be very nice to expose the incremental backup
functionality we've been working on in QEMU to users of libvirt.

--js

>> *Restore operation nuances*
>>
>> As it was written above to restore a domain one needs to start it in paused
>> state, export domain's disks and write them from backup. However qemu currently does
>> not let export disks for write even for a domain that never starts guests CPU.
>> We have an experimental qemu command option -x-vz-nbd-restore (passed together
>> with -incoming option) to fix it.
> 
> Why can't restore be done while the guest is offline?  (Oh right, we
> still haven't added decent qemu-img support for bitmap manipulation, so
> we need a qemu process around for any bitmap changes).
> 

I'm working on this right now, actually!

I'm working on JSON format output for bitmap querying, and simple
clear/delete commands. I hope to send this out very soon.

> As I understand it, the point of bitmaps and snapshots is to create an
> NBD server that a third-party can use to read just the dirty portions of
> a disk in relation to a known checkpoint, to save off data in whatever
> form it wants; so you are right that the third party then needs a way to
> rewrite data from whatever internal form it stored it in back to the
> view that qemu can consume when rolling back to a given backup, prior to
> starting the guest on the restored data.  Do you need additional libvirt
> APIs exposed for this, or do the proposed APIs for adding snapshots
> cover everything already with just an additional flag parameter that
> says whether the <domainblocksnapshot> is readonly (the third-party is
> using it for collecting the incremental backup data) or writable (the
> third-party is actively writing its backup into the file, and when it is
> done, then perform a block-commit to merge that data back onto the main
> qcow2 file)?
> 

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list