Hi, all. This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage. This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image. Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server. Here is typical actions on domain backup: - create temporary snapshot of domain disks of interest - export snaphots thru NBD - back them up - remove disks from export - delete temporary snapshot and typical actions on domain restore: - start domain in paused state - export domain disks of interest thru NBD for write - restore them - remove disks from export - resume or destroy domain Now let's write down API in more details. There are minor changes in comparison with previous version [1]. *Temporary snapshot API* In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now. virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags); virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags); virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags); virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags); virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags); Here is an example of snapshot xml description: <domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot> Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice. Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time. Checkpoints are visible in active domain xml: <disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk> Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function: int virDomainBlockCheckpointRemove(virDomainPtr domain, const char *name, unsigned int flags); *Block export API* I guess it is natural to treat qemu NBD server as a domain device. So we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported. While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported: int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags); int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags); I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now. These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design). So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags. Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course): <domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> </domainblockexport> And this is how this NBD server will be exposed in domain xml: <devices> ... <blockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport> ... </devices> *Implementation details from qemu-libvirt interactions POV* 1. Temporary snapshot - create snapshot - add fleece blockdev backed by disk of interest - start fleece blockjob which will pop out data to be overwritten to fleece blockdev { "execute": "blockdev-add" "arguments": { "backing": "drive-scsi0-0-0-0", "driver": "qcow2", "file": { "driver": "file", "filename": "/tmp/snapshot-a.hdd" }, "node-name": "snapshot-scsi0-0-0-0" }, } { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "target": "snapshot-scsi0-0-0-0" "sync": "none", }, } ] }, } - delete snapshot - cancel fleece blockjob - delete fleece blockdev { "execute": "block-job-cancel" "arguments": { "device": "drive-scsi0-0-0-0" }, } { "execute": "blockdev-del" "arguments": { "node-name": "snapshot-scsi0-0-0-0" }, } 2. Block export - add disks to export - start NBD server if it is not started - add disks { "execute": "nbd-server-start" "arguments": { "addr": { "type": "inet" "data": { "host": "0.0.0.0", "port": "49300" }, } }, } { "execute": "nbd-server-add" "arguments": { "device": "snapshot-scsi0-0-0-0", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8", "writable": false }, } - remove disks from export - remove disks - stop NBD server if there are no disks left { "arguments": { "mode": "hard", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8" }, "execute": "nbd-server-remove" } { "execute": "nbd-server-stop" } 3. Checkpoints (the most interesting part) First a few facts about qemu dirty bitmaps. Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot. Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits. Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme. We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too. So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this: - removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1} As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new. So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency. Now how exporting bitmaps looks like. - add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T - add bitmap to T to nbd export - remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration. - create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start ... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, } - delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints - create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap ... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, } - delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint - add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export ... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, } { "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", }, } - remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap ... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" } - delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled) { "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, } Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK). x-vz-block-dirty-bitmap-remove x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap *Restore operation nuances* As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it. *Links* [1] Previous version of RFC https://www.redhat.com/archives/libvir-list/2017-November/msg00514.html -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list