25.09.2019 16:52, John Snow wrote: > > > On 8/20/19 6:25 PM, John Snow wrote: >> Hi, downstream here at Red Hat I've been fielding some questions about >> the usability and feature readiness of Bitmaps (and related features) in >> QEMU. >> >> Here are some questions I answered internally that I am copying to the >> list for two reasons: >> >> (1) To make sure my answers are actually correct, and >> (2) To share this pseudo-reference with the community at large. >> >> This is long, and mostly for reference. There's a summary at the bottom >> with some todo items and observations about the usability of the feature >> as it exists in QEMU. >> >> Before too long, I intend to send a more summarized "roadmap" mail which >> details all of the current and remaining work to be done in and around >> the bitmaps feature in QEMU. >> >> >> Questions: >> >>> "What format(s) is/are required for this functionality?" >> >> From the QEMU API, any format can be used to create and author >> incremental backups. The only known format limitations are: >> >> 1. Persistent bitmaps cannot be created on any format except qcow2, >> although there are hooks to add support to other formats at a later date >> if desired. >> >> DANGER CAVEAT #1: Adding bitmaps to QEMU by default creates transient >> bitmaps instead of persistent ones. >> >> Possible TODO: Allow users to 'upgrade' transient bitmaps to persistent >> ones in case they made a mistake. >> >> >> 2. When using push backups (blockdev-backup, drive-backup), you may use >> any format as a target format. >> >> DANGER CAVEAT #2: without backing file and/or filesystem-less sparse >> support, these images will be unusable. >> >> EXAMPLE: Backing up to a raw file loses allocation information, so we >> can no longer distinguish between zeroes and unallocated regions. The >> cluster size is also lost. This file will not be usable without >> additional metadata recorded elsewhere.* >> >> (* This is complicated, but it is in theory possible to do a push backup >> to e.g. an NBD target with custom server code that saves allocation >> information to a metadata file, which would allow you to reconstruct >> backups. For instance, recording in a .json file which extents were >> written out would allow you to -- with a custom binary -- write this >> information on top of a base file to reconstruct a backup.) >> >> >> 3. Any format can be used for either shared storage or live storage >> migrations. There are TWO distinct mechanisms for migrating bitmaps: >> >> A) The bitmap is flushed to storage and re-opened on the destination. >> This is only supported for qcow2 and shared-storage migrations. >> >> B) The bitmap is live-migrated to the destination. This is supported for >> any format and can be used for either shared storage or live storage >> migrations. >> >> DANGER CAVEAT #3: The second bitmap migration technique there is an >> optional migration capability that must be enabled explicitly. >> Otherwise, some migration combinations may drop bitmaps. >> >> Matrix: >> >>> migrate = migrate_capability or (persistent and shared_storage) >> >> Enumerated: >> >> live storage + raw : transient + no-capability: Dropped >> live-storage + raw : transient + bm-capability: Migrated >> live-storage + qcow2 : transient + no-capability: Dropped >> live-storage + qcow2 : transient + bm-capability: Migrated >> live-storage + qcow2 : persistent + no-capability: Dropped (!) >> live-storage + qcow2 : persistent + bm-capability: Migrated >> >> shared-storage + raw : transient - no-capability: Dropped >> shared-storage + raw : transient + bm-capability: Migrated >> shared-storage + qcow2 : transient + no-capability: Migrated >> shared-storage + qcow2 : transient + bm-capability: Migrated >> shared-storage + qcow2 : persistent + no-capability: Migrated >> shared-storage + qcow2 : persistent + bm-capability: Migrated >> >> Enabling the bitmap migration capability will ALWAYS migrate the bitmap. >> If it's disabled, we will only migrate the bitmaps for shared storage >> migrations where the bitmap is persistent, which is a qcow2-only case. >> >> There is no warning or error if you attempt to migrate in a manner that >> loses your bitmaps. >> >> (I might be persuaded to add a case for when you are doing a live >> storage migration of qcow2 with persistent bitmaps, which is somewhat a >> conflicting case: you've asked for the bitmap to be persistent, but it >> seems likely that if this ever happens in practice, it's because you >> have neglected to ask for it to be migrated to the new host.) >> >> See iotest 169 for more details on this. >> >> At present, these are the only format limitations I am consciously aware >> of. From a management API/GUI perspective, it makes sense to restrict >> the feature set to "qcow2 only" to minimize edge cases. >> >> >>> "Is libvirt aware of these 'gotcha' cases?" >> >> From talks I've had with Eric Blake and Peter Krempa, they certainly are >> now. >> >> >>> "Is it possible to make persistent the default?" >> >> Not quickly. >> >> In QEMU, not without a deprecation period or some other incompatibility. >> Default values are not (yet?) introspectable via the schema. We need >> (possibly) up to two QAPI extensions: >> >> I) The ability to return deprecation warnings when issuing a command >> that will cease to work in the future. >> >> This has been being discussed somewhat on-list recently. It seems like >> there is not a big appetite for tackling something perceived as >> low-value because it is likely to be ignored. >> >> II) The ability to document default values in the QAPI schema for the >> purposes of introspection. >> >> With one or both of these extensions, we could remove the default value >> for persistence and promote it to a required argument with a >> transitionary period where it will work with a warning. Then, in the >> future, users will be forced to specify if they want persistent=true or >> persistent=false. >> >> This is not on my personal roadmap to implement. >> >> >>> "Is it possible to make bitmap migration the default?" >> >> I don't know at present. Migration capabilities are either "on" or "off" >> and the existing negotiation mechanisms for capabilities are extremely >> rudimentary. >> >> Changing this might require fiddling with machine compat properties, >> adding features to the migration protocol, or more. I don't know what I >> don't know, so I will estimate this change as likely invasive. >> >> I've discussed this with David Gilbert and it seems like a complicated >> project for the benefit of this sub-project alone, so this isn't on my >> personal roadmap to resolve. >> >> The general consensus appears to be that protecting the user is >> libvirt's job. >> >> >>> "Where do we stand with external snapshot support?" >> >> Still broken. In the aftermath of 4.1, it's the most obvious outstanding >> broken feature. Vladimir has patches to fix it, but they need some >> attention. >> > > It looks as if that the fix is a little risky, but the correct fix is > going to be much harder. Our reopen support simply does not accommodate > images needing to write dirty bits on open in a hierarchical graph. I tried the hard way, you may look through previous series versions. Kevin disliked it. > >> >>> "What needs to happen to bitmaps when doing stream or commit?" >> >> Uncertain in QEMU; creating an external snapshot implicitly ends the >> timeslice represented by the old bitmap, but an explicit checkpoint is >> better. >> >> I think some little ascii charts will help people understand what we're >> talking about here, so let's cover some examples. >> >> >> SCENARIO 1) >> >> Here's a timeline for a single node (one image, no backing files), with >> some points in time highlighted: >> >> Time T = 0.........................n >> +rec: [--X------Y------Z--------] >> -rec: [---------x------y--------] >> region: [aabbbbbbbcccccccddddddddd] >> >> >> X, Y, and Z are points in time where bitmaps 'x', 'y', and 'z' were >> created and began recording. x and y are points in time where bitmaps >> 'x' and 'y' stopped recording. >> >> This creates a few distinct regions / timeslices. >> >> a: Data written before we began tracking writes. >> b: Data written to bitmap 'x' >> c: Data written to bitmap 'y' >> d: data written to bitmap 'z' >> >> region 'a' is of an unknown size and indeterminate length, because there >> is no reference point (checkpoint) prior to it. >> >> regions 'b' and 'c' are of finite size and determinate length, because >> they have fixed reference points on either sides of their timeslice. >> >> region 'd' is also of an unknown size and indeterminate length, because >> it is actively recording and has no checkpoint to its right. It may be >> fixed at any time by disabling bitmap 'z'. >> >> In QEMU, generally what we want to do is to do several things at one >> atomic moment to keep these regions adjacent, contiguous, and disjoint. >> So from a high-level (using a fictional simplified syntax), we do: >> >> Transaction( >> create('y'), >> disable('x'), >> backup('x') >> ) >> >> which together performs a backup+checkpoint. >> >> We can do a backup without a checkpoint: >> >> 4.1: >> Transaction( >> create('tmp') >> merge('tmp', 'x') >> backup('tmp') >> ) >> >> 4.2: >>> backup('x', bitmap_sync=never) >> >> Or a checkpoint without a backup: >> >> Transaction( >> create('y'), >> disable('x') >> ) >> > > Concerning the following scenario: > >> >> SCENARIO 2) >> >> Now, what happens when we make an external snapshot and do nothing at >> all to our bitmaps? >> >> Time T = 0.......................................n >> +rec: [--X------Y------Z--------] <-- [-------] >> -rec: [---------x------y--------] <-- [-------] >> region: [aabbbbbbbcccccccddddddddd] <-- [eeeeeee] >> { base } <-- { top } >> >> We've created a new implicit timeslice, "e" without creating a new >> bitmap. Because the bitmap 'z' was still active at the time of the >> snapshot, it now has a temporarily-determinate endpoint to its region. >> >> This is kind of like an "implied checkpoint", but it's a very poor one >> because it's not really addressable. >> >> DANGER CAVEAT #4: We have no way to create incremental backups anymore, >> because the current moment in time has no addressable point. >> >> That's not great; but it is likely a fixable scenario when commit is >> fixed: committing the top layer back down into the base layer will add >> all new writes to the end of the old region; restoring our backup chain: >> >> Time T = 0.........................C.......n >> +rec: [--X------Y------Z-------- -------] >> -rec: [---------x------y-------- -------] >> region: [aabbbbbbbcccccccddddddddd ddddddd] >> >> Here, region 'e' just gets appended to region d, and we can make >> incremental backups from any checkpoint X, Y, Z to the current moment again. >> > > It's been brought to my attention that oVirt wants to be able to create > snapshots offline. > > It's not clear if they are willing to make these snapshots using > libvirt's offline support, or if they want to do it using qemu-img directly. > > If using libvirt, libvirt will be able to manage bitmaps as it sees fit, > even offline, using qemu and QMP to manage the images (offline). > > If it's the second, this snapshot scenario is the one they will > encounter, where we have a top layer that has no inherent checkpoint or > bitmap information. > > Ramifications of this were discussed below in the original email: > [scroll ...] > >> >> SCENARIO 3) >> >> What happens if we make a firm checkpoint at the same time we make the >> snapshot? >> >> Transaction( >> disable('z'), >> snapshot('top'), >> create('w') >> ) >> >> Time T = 0......................... ......n >> +rec: [--X------Y------Z-------- ] <-- [W------] >> -rec: [---------x------y--------z] <-- [-------] >> region: [aabbbbbbbcccccccddddddddd ] <-- [eeeeeee] >> { base } <-- { top } >> >> Now instead of the new region 'e' being implied, it's explicit. We can >> make backups between any point and the current moment *across* the gap. >> >> It was my thought that this was the most preferable method that libvirt >> should use, but there is some doubt from Peter Krempa. We'll see how it >> shakes out. >> >> >> >> There are questions about what QEMU should do by default, without >> libvirt's help. At the moment, it's "nothing" but there have been >> questions about "something". >> >> Keeping in mind that we likely can't change our existing behavior >> without some kind of flag, there are still some API/usability questions: >> >> >>> If we create an external snapshot on top of an image with actively >>> recording bitmaps, should we disable them? >> >> We can leave them enabled, but they'll never see any writes. Or we can >> explicitly disable them. Explicitly disabling them may make more sense >> to prevent modifying bitmaps accidentally on commit. >> >> My guess: No. we should leave them alone; allow checkpoint creation >> mechanisms to do the disable+create dance for bitmaps as needed. >> >> Potential problems: The backing image is read-only, and if we change our >> mind later, we will need to find a way to re-open the backing image as >> read-write for the purposes of toggling the recording bit prior to any >> legitimate guest usage of that node. Then, re-open as RO again. >> >> >> >>> Should we fork bitmaps (on snapshot)? >> >> If a bitmap named 'z' is recording when we create an external snapshot, >> should that bitmap be *copied* into the top layer? >> >> My guess: No. >> >> This would allow us to create external snapshots *without* creating a >> checkpoint, but conceptually that's a nightmare: It would allow for >> mutually independent creation of snapshots OR checkpoints. This would be >> hard to corral when undoing a snapshot, for instance. >> >> In my opinion, snapshots MUST be checkpoints, and therefore allowing a >> snapshot without creating a checkpoint is a no-go. >> >> >>> (Should we fork bitmaps) if we're not using checkpoints? >> >> If we are using a checkpoint-less paradigm (i.e. the rolling incremental >> backup using only one bitmap) we might want to copy the bitmap up to >> make the next incremental backup as if nothing ever happened. >> >> However, rolling incremental backups doesn't need any kind of auto-copy >> feature. This is possible today: >> >>> create('base', 'A') >>> transact(snapshot('top'), create('top', 'B')) >>> merge('B', [('base', 'A'), ('top', 'B')]) >> >> i.e., we create a new bitmap on the top layer, then merge in the old >> data from the backing file, which remains addressable. >> >> Whether the user wants to copy up or not, there are commands that will >> do that already. >> >> > > ... this following section covers some of avoiding the problems of the > scenario I replied to above, but mostly in the context of what QEMU can > do to prevent the scenario -- to which the conclusion was "nothing," > especially if snapshots are created without QEMU's facilitation (via > qemu-img.) > >>> Should we create new bitmaps by default when we can? >> >> If a backing image has bitmaps, should QEMU automatically create a new >> bitmap for the top layer? Should it be named something new, something >> user-provided, or based on existing active bitmaps? >> >> If a user creates a new external snapshot with no consideration paid to >> bitmaps (like "SCENARIO 2" above), they temporarily lose the ability to >> do incremental backups. They might be able to commit the image back to >> "try again." >> >> That's not great. Here are some options for resolving this: >> >> - Automatic names: Might cause collisions out-of-band with management >> tooling by accident, tooling has to query to discover what bitmaps were >> automatically created. >> >> - Same names: Can create namespace confusion when committing snapshots >> later; although each layer of a backing chain can have bitmaps named the >> same thing, it causes future problems when committing together that can >> be hard to resolve. >> >> - User-provided name: This is workable, and requires an amendment to the >> snapshot command to automatically create a new bitmap on the snapshot. >> >> >> My guess: No, we can't automatically create a new bitmap for the user. >> We can amend the snapshot commands to accept bitmap names, but at that >> point we've just duplicated transactions: >> >> Transact( >> snapshot('top'), >> create('top', 'new-bitmap') >> ) >> > > There's one last relevant mitigation discussed further down: [scroll ...] > >> >> All that said (Mostly a lot "No, let's not do anything"), maybe there's >> room for an "assistive" mode for users, a bitmap-aware snapshot creation >> command. It could do the following well-defined magic: >> >> bitmap-snapshot(base, top, bitmap_name): >> 1. disable any active bitmaps in the base node. >> 2. create a bitmap named "bitmap_name" in the top node, failing if >> a bitmap by that name already exists on either node. >> >> What this accomplishes: >> - Disables any bitmaps in the base layer ahead of time, in preparation >> for an eventual commit operation. >> - Always creates a new, enabled bitmap on the snapshot mode which is >> guaranteed not to conflict with a name on the base node. This bitmap can >> be used to create additional copies post-hoc, if desired. >> - Formalizes our "best practice" suggestion for mixing bitmaps and >> snapshots into a single, documented command. >> >> Is this strictly needed? No, if you have the foresight, you can do this >> instead: >> >> Transact( >> disable('a'), >> disable('b'), >> disable('c'), >> # plus however many more ... >> snapshot('top', ...), >> create('top', 'd') >> ) >> >> but a convenience command might still have a role to play in helping >> take the guesswork out for non-libvirt users. >> >> >> >> That's the bulk of what was discussed. >> >> Summary: >> >> >> GOTCHAs: >> #1: Bitmaps are created non-persistent by default, and can't be changed. >> >> #2: Push backup destination formats will happily back up to a format >> that isn't semantically useful. >> >> #3: Migrating non-shared block storage can drop even persistent bitmaps >> if you don't pass the bitmap migration capability flag to both QEMU >> instances. >> >> #4: Creating a snapshot without doing some bitmap manipulation >> beforehand can temporarily render your bitmaps unusable. Failing to >> disable bitmaps before creating a snapshot might make commits very >> tricky later on. >> >> Gotchas 1 and 4 can be at least partially alleviated. gotcha 2 remains a >> pain point we cannot intercept at the QEMU layer. gotcha 3 has potential >> remedies, but they are complicated. >> >> >> QEMU todo items: >> - Fix bitmap data corruption on commit (Ongoing, by Vladimir@Virtuozzo) >> >> - add a set_persistence method for bitmaps that allows us to change the >> storage class of a bitmap after creation. (Helps alleviate gotcha #1.) >> >> - Add a command that allows us to merge allocation data into a bitmap. >> This helps alleviate gotcha #4: If we create a new image but neglected >> to do the proper transaction dance, we can simply copy the allocation >> data into a new bitmap. (Note, we'd still need set_persistence to help >> us disable the old bitmap before any commit happens.) >> > > ... This was perceived at the time to be an unnecessary convenience > feature, because the belief was that libvirt should simply avoid this > from happening in the first place. > > However, if we acknowledge that snapshots may be made without libvirt's > help, this is a quick and easy way to "fix" checkpoint consistency post-hoc. Still, even without libvirt, management tool should avoid this from happening. Or we are saying about using qemu-img by hand by end-user without any management? And I'm still sure, that qemu-img is wrong instrument and better is to use qemu in stopped state for offline manipulations. But I'm not opposite to the idea, it should work of course. > > --js > >> - Add convenience command for easy + safe combination of bitmaps + >> snapshots. Helps prevent #4. >> >> >> Research items: >> - How hard is it to reopen a backing image as RW while it's in-use, >> disable a bitmap, and then reopen as RO? This is to partially address >> gotcha #4; if we forget to disable bitmaps before creating the snapshot. >> >> - How hard is the reverse operation? Can we reopen a backing image RW, >> enable a bitmap, and then reopen as RO? This gives us better control >> over what happens on commit. >> >> - After we fix the commit bug, what does/should commit actually do with >> bitmaps? What about bitmaps that collide? The current behavior is that >> any bitmaps don't transfer from top to base. Any bitmaps active in the >> base record all the new writes from the top. >> >> >> That's all! >> --js >> -- Best regards, Vladimir -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list