[BCC'ing those who have responded to earlier RFC's]
I've posted previous RFCs for improving snapshot support:
ideas on managing a subset of disks:
https://www.redhat.com/archives/libvir-list/2011-May/msg00042.html
ideas on managing snapshots of storage volumes not tied to a domain
https://www.redhat.com/archives/libvir-list/2011-June/msg00761.html
After re-reading the feedback received on those threads, I think I've
settled on a pretty robust design for my first round of adding
improvements to the management of snapshots tied to a domain, while
leaving the door open for future extensions.
Sorry this email is so long (I've had it open in my editor for more than
48 hours now as I keep improving it), but hopefully it is worth the
effort to read. See the bottom if you want the shorter summary on the
proposed changes.
First, some definitions:
========================
disk snapshot: the state of a virtual disk used at a given time; once a
snapshot exists, then it is possible to track a delta of changes that
have happened since that time.
internal disk snapshot: a disk snapshot where both the saved state and
delta reside in the same file (possible with qcow2 and qed). If a disk
image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.
external disk snapshot: a disk snapshot where the saved state is one
file, and the delta is tracked in another file. For a disk image not in
use by qemu, this can be done with qemu-img to create a new qcow2 file
wrapping any type of existing file. Recent qemu has also learned the
'snapshot_blkdev' monitor command for creating external snapshots while
qemu is using a disk, and the goal of this RFC is to expose that
functionality from within existing libvirt APIs.
saved state: all non-disk information used to resume a guest at the same
state, assuming the disks did not change. With qemu, this is possible
via migration to a file.
checkpoint: a combination of saved state and a disk snapshot. With
qemu, the 'savevm' monitor command creates a checkpoint using internal
snapshots. It may also be possible to combine saved state and disk
snapshots created while the guest is offline for a form of
checkpointing, although this RFC focuses on disk snapshots created while
the guest is running.
snapshot: can be either 'disk snapshot' or 'checkpoint'; the rest of
this email will attempt to use 'snapshot' where either form works, and a
qualified term where no ambiguity is intended.
Existing libvirt functionality
==============================
The virDomainSnapshotCreateXML currently manages a hierarchy of
"snapshots", although it is currently only used for "checkpoints", where
every snapshot has a name and a possibly empty parent. The idea is that
once a domain has a snapshot, there is always a current snapshot, and
all new snapshots are created with a parent of a previously existing
snapshot (although there are still some bugs to be fixed in managing the
current snapshot over a libvirtd restart). It is possible to have
disjoint hierarchies, if you delete a root snapshot that had more than
one child (making both children become independent roots). The snapshot
hierarchy is maintained by libvirt (in a typical installation, the files
in /var/lib/libvirt/qemu/snapshot/<dom>/<name> track each named
snapshot, using <domainsnapshot> XML); using additional metadata not
present in the qcow2 internal snapshot format (that is, while qcow2 can
maintain multiple snapshots, it does not maintain relations between
them). Remember, the "current" snapshot is not the current machine
state, but the snapshot that would become the parent if you create a new
snapshot; perhaps we could have named it the "loaded" snapshot, but the
API names are set in stone now.
Libvirt also has APIs for listing all snapshots, querying the current
snapshot, reverting back to the state of another snapshot, and deleting
a snapshot. Deletion comes with a choice of deleting just that named
version (removing one node in the hierarchy and re-parenting all
children) or that tree of the hierarchy (that named version and all
children).
Since qemu checkpoints can currently only be created via internal disk
snapshots, libvirt has not had to track any file name relationships - a
single "snapshot" corresponds to a qcow2 snapshot name within all qcow2
disks associated to a domain; furthermore, snapshot creation was limited
to domains where all modifiable disks were already in qcow2 format.
However, these "checkpoints" could be created on both running domains
(qemu savevm) or inactive domains (qemu-img snapshot -c), with the
latter technically being a case of just internal disk snapshots.
Libvirt currently has a bug in that it only saves <domain>/<uuid> rather
than the full domain xml along with a checkpoint - if any devices are
hot-plugged (or in the case of offline snapshots, if the domain
configuration is changed) after a snapshot but before the revert, then
things will most likely blow up due to the differences in devices in use
by qemu vs. the devices expected by the snapshot.
Reverting to a snapshot can also be considered as a form of data loss -
you are discarding the disk changes and ram state that have happened
since the last snapshot. To some degree, this is by design - the very
nature of reverting to a snapshot implies throwing away changes;
however, it may be nice to add a safety valve so that by default,
reverting to a live checkpoint from an offline state works, but
reverting from a running domain should require some confirmation that it
is okay to throw away accumulated running state.
Libvirt also currently has a limitation where snapshots are local to one
host - the moment you migrate a snapshot to another host, you have lost
access to all snapshot metadata.
Proposed enhancements
=====================
Note that these proposals merely add xml attribute and subelement
extensions, as well as API flags, rather than creating any new API,
which makes it a nice candidate for backporting the patch series based
on this RFC into older releases as appropriate.
Creation
++++++++
I propose reusing the virDomainSnapshotCreateXML API and
<domainsnapshot> xml for both "checkpoints" and "disk snapshots", all
maintained within a single hierarchy. That is, the parent of a disk
snapshot can be a checkpoint or another disk snapshot, and the parent of
a checkpoint can be another checkpoint or a disk snapshot. And, since I
defined "snapshot" to mean either "checkpoint" or "disk snapshot", this
single hierarchy of "snapshots" will still be valid once it is expanded
to include more than just "checkpoints". Since libvirt already has to
maintain additional metadata to track parent-child relationships between
snapshots, it should not be hard to augment that XML to store additional
information needed to track external disk snapshots.
The default is that virDomainSnapshotCreateXML(,0) creates a checkpoint,
while leaving qemu running; I propose two new flags to fine-tune things:
virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_HALT) will
create the checkpoint then halt the qemu process, and
virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) will
create a disk snapshot rather than a checkpoint (on qemu, by using a
sequence including the new 'snapshot_blkdev' monitor command).
Specifying both flags at once is a form of data loss (you are losing the
ram state), and I suspect it to be rarely used, but since it may be
worthwhile in testing whether a disk snapshot is truly crash-consistent,
I won't refuse the combination.
Other flags may be added in the future; I know of at least two features
in qemu that may warrant some flags once they are stable: 1. a guest
agent fsfreeze/fsthaw command will allow the guest to get the file
system into a stable state prior to the snapshot, meaning that reverting
to that snapshot can skip out on any fsck or journal replay actions. Of
course, this is a best effort attempt since guest agent interaction is
untrustworthy (comparable to memory ballooning - the guest may not
support the agent or may intentionally send falsified responses over the
agent), so the agent should only be used when explicitly requested -
this would be done with a new flag
VIR_DOMAIN_SNAPSHOT_CREATE_GUEST_FREEZE. 2. there is thought of adding
a qemu monitor command to freeze just I/O to a particular subset of
disks, rather than the current approach of having to pause all vcpus
before doing a snapshot of multiple disks. Once that is added, libvirt
should use the new monitor command by default, but for compatibility
testing, it may be worth adding VIR_DOMAIN_SNAPSHOT_CREATE_VCPU_PAUSE to
require a full vcpu pause instead of the faster iopause mechanism.
My first xml change is that <domainsnapshot> will now always track the
full <domain> xml (prior to any file modifications), normally as an
output-only part of the snapshot (that is, a <domain> sublement of
<domainsnapshot> will always be present in virDomainGetXMLDesc, but is
generally ignored in virDomainSnapshotCreateXML - more on this below).
This gives us the capability to use XML ABI compatibility checks
(similar to those used in virDomainMigrate2, virDomainRestoreFlags, and
virDomainSaveImageDefineXML). And, given that the full <domain> xml is
now present in the snapshot metadata, this means that we need to add
virDomainSnapshotGetXMLDesc(snap, VIR_DOMAIN_XML_SECURE), so that any
security-sensitive data doesn't leak out to read-only connections.
Right now, domain ABI compatibility is only checked for
VIR_DOMAIN_XML_INACTIVE contents of xml; I'm thinking that the snapshot
<domain> will always be the inactive version (sufficient for starting a
new qemu), although I may end up changing my mind and storing the active
version (when attempting to revert from live qemu to another live
checkpoint, all while using a single qemu process, the ABI compatibility
checking may need enhancements to discover differences not visible in
inactive xml but fatally different between the active xml when using
'loadvm', but which not matter to virsh save/restore where a new qemu
process is created every time).
Next, we need a way to control which subset of disks is involved in a
snapshot command. Previous mail has documented that for ESX, the
decision can only be made at boot time - a disk can be persistent
(involved in snapshots, and saves changes across domain boots);
independent-persistent (is not involved in snapshots, but saves changes
across domain boots); or independent-nonpersistent (is not involved in
snapshots, and all changes during a domain run are discarded when the
domain quits). In <domain> xml, I will represent this by two new
optional attributes:
<disk snapshot='no|external|internal' persistent='yes|no'>...</disk>
For now, qemu will reject snapshot=internal (the snapshot_blkdev monitor
command does not yet support it, although it was documented as a
possible extension); I'm not sure whether ESX supports external,
internal, or both. Likewise, both ESX and qemu will reject
persistent=no unless snapshot=no is also specified or implied (it makes
no sense to create a snapshot if you know the disk will be thrown away
on next boot), but keeping the options orthogonal may prove useful for
some future extension. If either option is omitted, the default for
snapshot is 'no' if the disk is <shared> or <readonly> or persistent=no,
and 'external' otherwise; and the default for persistent is 'yes' for
all disks (domain_conf.h will have to represent nonpersistent=0 for
easier coding with sane 0-initialized defaults, but no need to expose
that ugly name in the xml). I'm not sure whether to reject an explicit
persistent=no coupled with <readonly>, or just ignore it (if the disk is
readonly, it can't change, so there is nothing to throw away after the
domain quits). Creation of an external snapshot requires rewriting the
active domain XML to reflect the new filename.
While ESX can only select the subset of disks to snapshot at boot time,
qemu can alter the selection at runtime. Therefore, I propose also
modifying the <domainsnapshot> xml to take a new subelement <disks> to
fine-tune which disks are involved in a snapshot. For now, a checkpoint
must omit <disks> on virDomainSnapshotCreateXML input (that is, <disks>
must only be present if the VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY is
used, and checkpoints always cover full system state, and on qemu this
checkpoint uses internal snapshots). Meanwhile, for disk snapshots, if
the <disks> element is omitted, then one is automatically created using
the attributes in the <domain> xml. For ESX, if the <disks> element is
present, it must select the same disks as the <domain> xml. Offline
checkpoints will continue to use <state>shutoff</state> in the xml
output, while new disk snapshots will use <state>disk-snapshot</state>
to indicate that the disk state was obtained from a running VM and might
be only crash-consistent rather than stable.
The <disks> element has an optional number of <disk> subelements; at
most one per <disk> in the <devices> section of <domain>. Each <disk>
element has a mandatory attribute name='name', which must match the
<target dev='name'/> of the <domain> xml, as a way of getting 1:1
correspondence between domainsnapshot/disks/disk and domain/devices/disk
while using names that should already be unique. Each <disk> also has
an optional snapshot='no|internal|external' attribute, similar to the
proposal for <domain>/<devices>/<disk>; if not provided, the attribute
defaults to the one from the <domain>. If snapshot=external, then there
may be an optional subelement <source file='path'/>, which gives the
desired new file name. If external is requested, but the <source>
subelement is not present, then libvirt will generate a suitable
filename, probably by concatenating the existing name with the snapshot
name, and remembering that the snapshot name is generated as a timestamp
if not specified. Also, for external snapshots, the <disk> element may
have an optional sub-element specifying the driver (useful for selecting
qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again,
this can normally be generated by default.
Future extensions may include teaching qemu to allow coupling
checkpoints with external snapshots by allowing a <disks> element even
for checkpoints. (That is, while the initial implementation will always
output <disks> for <state>disk-snapshot</state> and never output <disks>
for <state>shutoff</state>, but this may not always hold in the future).
Likewise, we may discover when implementing lvm or btrfs snapshots
that additional subelements to each <disk> would be useful for
specifying additional aspects for creating snapshots using that
technology, where the omission of those subelements has a sane default
state.
libvirt can be taught to honor persistent=no for qemu by creating a
qcow2 wrapper file prior to starting qemu, then tearing down that
wrapper after the fact, although I'll probably leave that for later in
my patch series.
As an example, a valid input <domainsnapshot> for creation of a qemu
disk snapshot would be:
<domainsnapshot>
<name>snapshot</name>
<disks>
<disk name='vda'/>
<disk name='vdb' snapshot='no'/>
<disk name='vdc' snapshot='external'>
<source file='/path/to/new'/>
</disk>
</disks>
</domainsnapshot>
which requests that the <disk> matching the target dev=vda defer to the
<domain> default for whether to snapshot (and if the domain default
requires creating an external snapshot, then libvirt will create the new
file name; this could also be specified by omitting the <disk
name='vda'/> subelement altogether); the <disk> matching vdb is not
snapshotted, and the <disk> matching vdc is involved in an external
snapshot where the user specifies the new filename of /path/to/new. On
dumpxml output, the output will be fully populated with the items
generated by libvirt, and be displayed as:
<domainsnapshot>
<name>snapshot</name>
<state>disk-snapshot</state>
<parent>
<name>prior</name>
</parent>
<creationTime>1312945292</creationTime>
<domain>
<!-- previously just uuid, but now the full domain XML,
including... -->
...
<devices>
<disk type='file' device='disk' snapshot='external'>
<driver name='qemu' type='raw'/>
<source file='/path/to/original'/>
<target dev='vda' bus='virtio'/>
</disk>
...
</devices>
</domain>
<disks>
<disk name='vda' snapshot='external'>
<driver name='qemu' type='qcow2'/>
<source file='/path/to/original.snapshot'>
</disk>
<disk name='vdb' snapshot='no'/>
<disk name='vdc' snapshot='external'>
<driver name='qemu' type='qcow2'/>
<source file='/path/to/new'/>
</disk>
</disks>
</domainsnapshot>
And, if the user were to do 'virsh dumpxml' of the domain, they would
now see the updated <disk> contents:
<domain>
...
<devices>
<disk type='file' device='disk' snapshot='external'>
<driver name='qemu' type='qcow2'/>
<source file='/path/to/original.snapshot'/>
<target dev='vda' bus='virtio'/>
</disk>
...
</devices>
</domain>
++++++++++
Reverting
When it comes to reverting to a snapshot, the only time it is possible
to revert to a live image is if the snapshot is a "checkpoint" of a
running or paused domain, because qemu must be able to restore the ram
state. Reverting to any other snapshot (both the existing "checkpoint"
of an offline image, which uses internal disk snapshots, and my new
"disk snapshot" which uses external disk snapshots even though it was
created against a running image), will revert the disks back to the
named state, but default to leaving the guest in an offline state. Two
new mutually exclusive flags will allow to both revert to snapshot disk
state and affect the resulting qemu state;
virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_START) to run
from the snapshot, and virDomainRevertToSnapshot(snap,
VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE) to create a new qemu process but leave
it paused. If neither of these two flags is specified, then the default
will be determined by the snapshot itself. These flags also allow
overriding the running/paused aspect recorded in live checkpoints. Note
that I am not proposing a flag for reverting to just the disk state of a
live checkpoint; this is considered an uncommon operation, and can be
accomplished in two steps by reverting to paused state to restore disk
state followed by destroying the domain (but I can add a third
mutually-exclusive flag VIR_DOMAIN_SNAPSHOT_REVERT_STOP if we decide
that we really want this uncommon operation via a single API).
Reverting from a stopped state is always allowed, even if the XML is
incompatible, by basically rewriting the domain's xml definition.
Meanwhile, reverting from an online VM to a live checkpoint has two
flavors - if the XML is compatible, then the 'loadvm' monitor command
can be used, and the qemu process remains alive. But if the XML has
changed incompatibly since the checkpoint was created, then libvirt will
refuse to do the revert unless it has permission to start a new qemu
process, via another new flag: virDomainRevertToSnapshot(snap,
VIR_DOMAIN_SNAPSHOT_REVERT_FORCE). The new REVERT_FORCE flag also
provides a safety valve - reverting to a stopped state (whether an
existing offline checkpoint, or a new disk snapshot) from a running VM
will be rejected unless REVERT_FORCE is specified. For now, this
includes the case of using the REVERT_START flag to revert to a disk
snapshot and then start qemu - this is because qemu does not yet expose
a way to safely revert to a disk snapshot from within the same qemu
process. If, in the future, qemu gains support for undoing the effects
of 'snapshot_blkdev' via monitor commands, then it may be possible to
use REVERT_START without REVERT_FORCE and end up reusing the same qemu
process while still reverting to the disk snapshot state, by using some
of the same tricks as virDomainReboot to force the existing qemu process
to boot from the new disk state.
Of course, the new safety valve is a slight change in behavior - scripts
that used to use 'virsh snapshot-revert' may now have to use 'virsh
snapshot-revert --force' to do the same actions; for backwards
compatibility, the virsh implementation should first try without the
flag, and a new VIR_ERR_* code be introduced in order to let virsh
distinguish between a new implementation that rejected the revert
because _REVERT_FORCE was missing, and an old one that does not support
_REVERT_FORCE in the first place. But this is not the first time that
added safety valves have caused existing scripts to have to adapt -
consider the case of 'virsh undefine' which could previously pass in a
scenario where it now requires 'virsh undefine --managed-save'.
For transient domains, it is not possible to make an offline checkpoint
(since transient domains don't exist if they are not running or paused);
transient domains must use REVERT_START or REVERT_PAUSE to revert to a
disk snapshot. And given the above limitations about qemu, reverting to
a disk snapshot will currently require REVERT_FORCE, since a new qemu
process will necessarily be created.
Just as creating an external disk snapshot rewrote the domain xml to
match, reverting to an older snapshot will update the domain xml (it
should be a bit more obvious now why the
<domainsnapshot>/<domain>/<devices>/<disk> lists the old name, while
<domainsnapshot>/<disks>/<disk> lists the new name).
The other thing to be aware of is that with internal snapshots, qcow2
maintains a distinction between current state and a snapshot - that is,
qcow2 is _always_ tracking a delta, and never modifies a named snapshot,
even when you use 'qemu-img snapshot -a' to revert to different snapshot
names. But with named files, the original file now becomes a read-only
backing file to a new active file; if we revert to the original file,
and make any modifications to it, the active file that was using it as
backing will be corrupted. Therefore, the safest thing is to reject any
attempt to revert to any snapshot (whether checkpoint or disk snapshot)
that has an existing child snapshot consisting of an external disk
snapshot. The metadata for each of these children can be deleted
manually, but that requires quite a few API calls (learn how many
children exist, get the list of children, and for each child, get its
xml to see if that child has the target snapshot as a parent, and if so
delete the snapshot). So as shorthand, virDomainRevertToSnapshot will
be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which
first deletes any children of the snapshot about to be deleted prior to
reverting to that particular child.
And as long as reversion is learning how to do some snapshot deletion,
it becomes possible to decide what to do with the qcow2 file that was
created at the time of the disk snapshot. The default behavior for qemu
will be to use qemu-img to recreate the qcow2 wrapper file as a 0-delta
change against the original file, and keeping the domain xml tied to the
wrapper name, but a new flag VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD can be
used to instead completely delete the qcow2 wrapper file, and update the
domain xml back to the original filename.
Deleting
++++++++
Deleting snapshots also needs some improvements. With checkpoints, the
disk snapshot contents were internal snapshots, so no files had to be
deleted. But with external disk snapshots, there are some choices to be
made - when deleting a snapshot, should the two files be consolidated
back into one or left separate, and if consolidation occurs, what should
be the name of the new file.
Right now, qemu supports consolidation only in one direction - the
backing file can be consolidated into the new file by using the new
blockpull API. In fact, the combination of disk snapshot and block pull
can be used to implement local storage migration - create a disk
snapshot with a local file as the new file around the remote file used
as the snapshot, then use block pull to break the ties to the remote
snapshot. But there is currently no way to make qemu save the contents
of a new file back into its backing file and then swap back to the
backing file as the live disk; also, while you can use block pull to
break the relation between the snapshot and the live file, and then
rename the live file back over the backing file name, there is no way to
make qemu revert back to that file name short of doing the
snapshot/blockpull algorithm twice; and the end result will be qcow2
even if the original file was raw. Also, if qemu ever adds support for
merging back into a backing file, as well as a means to determine how
dirty a qcow2 file is in relation to its backing file, there are some
possible efficiency gains - if most blocks of a snapshot differ from the
backing file, it is faster to use blockpull to pull in the remaining
blocks from the backing file to the active file; whereas if most blocks
of a snapshot are inherited from the backing file, it is more efficient
to pull just the dirty blocks from the active file back into the backing
file. Knowing whether the original file was qcow2 or some other format
may also impact how to merge deltas from the new qcow2 file back into
the original file.
Additionally, having fine-tuned control over which of the two names to
keep when consolidating a snapshot would require passing that
information through xml, but the existing virDomainSnapshotDelete does
not take an XML argument. For now, I propose that deleting an external
disk snapshot will be required to leave both the snapshot and live disk
image files intact (except for the special case of REVERT_DISCARD
mentioned above that combines revert and delete into a single API); but
I could see the feasibility of a future extension which adds a new XML
<on_delete> subelement to <domainsnapshot>/<disks>/<disk> flags that
specifies which of two files to consolidate into, as well as a flag
VIR_DOMAIN_SNAPSHOT_DELETE_CONSOLIDATE which triggers libvirt to do the
consolidation for any <on_delete> subelements in the snapshot being
deleted (if the flag is omitted, the <on_delete> subelement is ignored
and both files remain).
The notion of deleting all children of a snapshot while keeping the
snapshot itself (mentioned above under the revert use case) seems common
enough that I will add a flag VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY;
this flag implies VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN, but leaves the
target snapshot intact.
Undefining
++++++++++
In one regards, undefining a domain that has snapshots is just as bad as
undefining a domain with managed save state - since libvirt is
maintaining metadata about snapshot hierarchies, leaving this metadata
behind _will_ interfere with creation of a new domain by the same name.
However, since both checkpoints and snapshots are stored in
user-accessible disk images, and only the metadata is stored by libvirt,
it should eventually be possible for the user to decide whether to
discard the metadata but keep the snapshot contents intact in the disk
images, or to discard both the metadata and the disk image snapshots.
Meanwhile, I propose changing the default behavior of
virDomainUndefine[Flags] to reject attempts to undefine a domain with
any defined snapshots, and to add a new flag for virDomainUndefineFlags,
virDomainUndefineFlags(,VIR_DOMAIN_UNDEFINE_SNAPSHOTS), to act as
shorthand for calling virDomainSnapshotDelete for all snapshots tied to
the domain. Note that this deletes the metadata, but not the underlying
storage volumes.
Migration
+++++++++
The simplest solution to the fact that snapshot metadata is host-local
is to make migration attempts fail if a domain has any associated
snapshots. For a first cut patch, that is probably what I'll go with -
it reduces libvirt functionality, but instantly plugs all the bugs that
you can currently trigger by migrating a domain with snapshots.
But we can do better. Right now, there is no way to inject the metadata
associated with an already-existing snapshot, whether that snapshot is
internal or external, and deleting internal snapshots always deletes the
data as well as the metadata. But I already documented that external
snapshots will keep both the new file and it's read-only original, in
most cases, which means the data is preserved even when the snapshot is
deleted. With a couple new flags, we can have
virDomainSnapshotDelete(snap, VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY)
which removes libvirt's metadata, but still leaves all the data of the
snapshot present (visible to qemu-img snapshot -l or via multiple file
names); as well as virDomainSnapshotCreateXML(dom, xml,
VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE), which says to add libvirt snapshot
metadata corresponding to existing snapshots without doing anything to
the current guest (no 'savevm' or 'snapshot_blkdev', although it may
still make sense to do some sanity checks to see that the metadata being
defined actually corresponds to an existing snapshot in 'qemu-img
snapshot -l' or that an external snapshot file exists and has the
correct backing file to the original name).
Additionally, with these two tools in place, you can now make
ABI-compatible tweaks to the <domain> xml stored in a snapshot metadata
(similar to how 'virsh save-image-edit' can tweak a save image, such as
changing the host name of a <disk>'s image to match what was done
externally with qemu-img or other external tool). You can also make an
extended protocol that first dumps all snapshot xml on the source,
redefines those snapshots on the destination, then deletes the metadata
on the source, all before migrating the domain itself (unfortunately, I
don't think it can be wired into the cookies of migration protocol v3,
as each <domainsnapshot> xml for each snapshot will be larger than the
<domain> itself, and an arbitrary number of snapshots with lots of xml
don't fit into a finite-sized cookie over rpc; ultimately, this may mean
a migration protocol v4 that has an arbitrary number of handshakes
between Begin on the source and Prepare on the dest in order to properly
handle all the interchange - having a feature negotiation between client
and host should be part of that interchange).
Future proposals
================
I still want to add APIs to manage storage volume snapshots for storage
volumes not associated with a current domain, as well as enhancing disk
snapshots to operate on more than just qcow2 file formats (for example,
lvm snapshots or btrfs copy-on-write clones). But I've already signed
up for quite a bit of code changes in just this email, so that will have
to come later. I hope that what I have designed here does not preclude
extensibility to future additions - for example, <storagevolsnapshot>
would be able to use a single <disk> sublement similar to the above
<domainsnapshot>/<disks>/<disk> sublement for describing the relation
between a disk and its backing file snapshot.
Quick Summary
=============
These are the changes I plan on making soon; I mentioned other possible
future changes above that would depend on these being complete first, or
which involve creation of new API.
The following API patterns currently "succeed", but risk data loss or
other bugs that can get libvirt into an inconsistent state; they will
now fail by default:
virDomainRevertToSnapshot to go from a running VM to a stopped
checkpoint will now fail by default. Justification: stopping a running
domain is a form of data loss. Mitigation: use
VIR_DOMAIN_SNAPSHOT_REVERT_FORCE for old behavior.
virDomainRevertToSnapshot to go from a running VM to a live checkpoint
with an ABI-incompatible <domain> will now fail by default.
Justification: qemu does not handle ABI incompatibilities, and even if
the 'loadvm' may have succeeded, this generally resulted in fullscale
guest corruption. Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE to
start a new qemu process that properly conforms to the snapshot's ABI.
virDomainUndefine will now fail to undefine a domain with any snapshots.
Justification: leaving behind libvirt metadata can corrupt future
defines, comparable to recent managed save changes, plus it is a form of
data loss. Mitigation: use virDomainUndefineFlags.
virDomainUndefineFlags will now default to failing an undefine of a
domain with any snapshots. Justification: leaving behind libvirt
metadata can corrupt future defines, comparable to recent managed save
changes, plus it is a form of data loss. Mitigation: separately delete
all snapshots (or at least all snapshot metadata) first, or use
VIR_DOMAIN_UNDEFINE_SNAPSHOTS.
virDomainMigrate/virDomainMigrate2 will now default to fail if the
source has any snapshots. Justification: metadata must be transferred
along with the domain for the migration to be complete. Mitigation:
until an improved migration protocol can automatically do the
handshaking necessary to migrate all the snapshot metadata, a user can
manually loop over each snapshot prior to migration, using
virDomainSnapshotCreateXML with VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE on
the destination, then virDomainSnapshotDelete with
VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY on the source.
Add the following XML:
in <domain>/<devices>/<disk>:
add optional attribute snapshot='no|internal|external'
add optional attribute persistent='yes|no'
in <domainsnapshot>:
expand <domainsnapshot>/<domain> to be full domain, not just uuid
add <state>disk-snapshot</state>
add optional <disks>/<disk>, where each <disk> maps back to
<domain>/<devices>/<disk> and controls how to do external disk snapshots
Add the following flags to existing API:
virDomainSnapshotCreateXML:
VIR_DOMAIN_SNAPSHOT_CREATE_HALT
VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY
VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE
virDomainSnapshotGetXMLDesc
VIR_DOMAIN_XML_SECURE
virDomainRevertToSnapshot
VIR_DOMAIN_SNAPSHOT_REVERT_START
VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE
VIR_DOMAIN_SNAPSHOT_REVERT_FORCE
VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN
VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD
virDomainSnapshotDelete
VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY
VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY
virDomainUndefineFlags
VIR_DOMAIN_UNDEFINE_SNAPSHOTS
--
Eric Blake eblake@xxxxxxxxxx +1-801-349-2682
Libvirt virtualization library http://libvirt.org
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list