Here's what I'm planning on implementing for libvirt 0.9.11 to support oVirt's desire to do live block migration, and built on top of qemu 1.1's new 'transaction' QMP monitor command. Comments are welcome before I actually post patches. Background ========== Here is oVirt's description of mirrored live storage migration: http://www.ovirt.org/wiki/Features/Design/StorageLiveMigration The idea is that at all points in time, at least one storage domain has a consistent view of all data in use by the guest. That way, if something fails and has to be restarted, oVirt can tell libvirt to create a new transient domain that points to the storage domain with consistent data, and restart the migration process, rather than the post-copy approach that would spread data across two storage domains at once. For more background, here is the qemu feature page for the 'transaction' monitor command; that wiki page includes a section which summarizes the impacts to libvirt as proposed in this email: http://wiki.qemu.org/Features/SnapshotsMultipleDevices One of the goals of this proposal is to add mirrored live block migration without adding any new API, so that the feature can be backported to any distro that ships with the API in libvirt 0.9.10. My proposals for libvirt 0.9.11 =============================== Libvirt will probe qemu to see if it knows the 'transaction' monitor command, and set a bit in qemuCaps accordingly. virDomainSnapshotCreateXML will learn a new flag: VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC. If this flag is present, then libvirt guarantees that the snapshot operation will either succeed, or that failure will be reported without changing domain XML or qemu runtime state. If present, the creation API will fail if qemu lacks the 'transaction' command and more than one disk snapshot was requested in the <domainsnapshot> XML. If this flag is not present, then libvirt will use 'transaction' if available, but fall back to 'blockdev-snapshot-sync', so that it works with older qemu, but where the caller then has to check virDomainGetXMLDesc on failure to see if a partial snapshot occurred. This flag will be implied by any other part of the API that requires the use of 'transaction'. The VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT flag was added to virDomainSnapshotCreateXML in 0.9.10, with semantics that it would stop libvirt from complaining if a regular file already existed as the snapshot destination, but without interacting with qemu, which would blindly overwrite the contents of that file. Since this flag is relatively new, and has not had much use, I propose to slightly alter its documented semantics to now interact with the qemu 1.1 feature being added as part of 'transaction'. If qemu supports 'transaction', then presence of this flag implies that libvirt will explicitly request 'mode':'existing' for each snapshot, which tells qemu to open the existing file without writing any new metadata, and that the caller is responsible to ensure that the file has identical guest contents (generally by creating a qcow2 file with the current file as backing image and no additional contents). Additionally, libvirt will now require the file to already exist (in 0.9.10, libvirt silently ignored the fact if the flag was requested but the file did not exist). Presence of the flag without qemu support for 'transaction' will now fail (that is, VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT will now imply VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC). Absence of the flag means that libvirt will rely on qemu's default to 'mode':'absolute-paths', and will require that the file does not exist as a regular file; this maps to qemu 1.0 always writing a new qcow2 header with absolute backing file name. If we want to later expose additional modes, like 'no-backing-file', it would be done via per-<disk> annotations in the <domainsnapshot> XML rather than via new flags, but for this proposal, I think oVirt is okay using the flag to set a single policy for all disks mentioned in a given snapshot request. virDomainSnapshotCreateXML's xml argument, <domainsnapshot>, will learn an optional <mirror> sub-element to each <disk>. While the 'transaction' command supports multiple mirrors in one transaction, for now, libvirt will enforce at most one mirror, which should be sufficient for oVirt's needs. (Adding more support for the rest of the power of 'transaction' is probably best left for new libvirt API, but that's outside the scope of this proposal). As an example, <domainsnapshot> <disks> <disk name='/src/base.img' snapshot='external'> <source file='/src/snap.img'/> <mirror file='/dest/snap.img'/> </disk> </disks> </domainsnapshot> would create a new libvirt snapshot object with /src/snap.img as the read-write new image, and /dest/snap.img as the new write-only mirror. On success, this rewrites the domain's live XML to point to /src/snap.img as its current file. Finally, virDomainSnapshotDelete will learn a new flag, VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR, which says that the libvirt snapshot object will be deleted, but only after first calling the qemu 'drive-reopen' monitor command for all disks that had a <mirror> in the associated snapshot object. That is, for the above example, this would reopen the disk from it's current read-write of /src/snap.img over to the second storage domain's /dest/snap.img with it's accompanying mirrored backing chain. On success, this rewrites the domain's live XML to point to the just-opened mirror location. This flag will fail if the libvirt snapshot being deleted is not the current image, or if the snapshot being deleted does not have any mirrored disks. Conclusion ========== Back to the oVirt diagram, the transition from step 1 to 2 is done by oVirt, the transition from step 2 to 3 is done by oVirt pre-creating Snapshot 2 on storage domain 2 with a backing file of a relative pathname to Snapshot 1, then creating a new libvirt snapshot with: snap = virDomainSnapshotCreateXML(dom, "<!-- XML with a <mirror> element for the migrated disk(s) -->...", VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY | VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT); (VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC will be implied, since <mirror> requires it, but can be provided for clarity; oVirt may also wish to use VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE, although that is not strictly necessary and would only work if a guest agent is present). Then, the transition from step 3 to 4 is done by oVirt copying Snapshot 1 in the background, and the transition from step 4 to 5 is done by oVirt calling: virDomainSnapshotDelete(snap, VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR); at which point the running qemu will be using the full image chain located completely on storage 2, with libvirt having updated the domain XML to reflect the new path name, and with the libvirt snapshot object no longer present since the migration is complete. If oVirt then desires to cut Snapshot 1 out of the backing chain, and have Snapshot 2 backed directly by the Base volume, then oVirt would then call: virDomainSnapshotRebase(dom, "...disk", "Base", 0, 0) to trigger a 'block_stream' monitor command that resets Snapshot 2 to directly use Base as its backing file (effectively merging Snapshot 1 into Snapshot 2). -- Eric Blake eblake@xxxxxxxxxx +1-919-301-3266 Libvirt virtualization library http://libvirt.org
Attachment:
signature.asc
Description: OpenPGP digital signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list