This is the counter-proposal to my earlier RFC for storage migration via snapshot mirrors[1], resulting from a NACK on the code review for that earlier proposal[2]. In particular, this proposal fleshes out some of Paolo's design overview on the qemu wiki[3]. [1] https://www.redhat.com/archives/libvir-list/2012-March/msg00578.html [2] https://www.redhat.com/archives/libvir-list/2012-March/msg01033.html [3] http://wiki.qemu.org/Features/SnapshotsMultipleDevices My plan is to have everything in this RFC coded up in the next couple of days (hopefully no later than Thursday); this has missed the feature freeze for 0.9.11, so it should not be applied upstream until after the weekend release, as one of the first patches for 0.9.12. Backport-wise, the new flags can be backported as far back as the 0.9.10 .so API, but the new virDomainBlockCopy() API cannot be exported when doing a backport without breaking .so versions (although it's implemenation can be used internally). Additions ========= The following new error code will be added: VIR_ERR_BLOCK_COPY_IN_PROGRESS The following new API will be added: int virDomainBlockCopy(virDomainPtr dom, const char *disk, const char *base, const char *dest, const char *format, unsigned long bandwidth, unsigned int flags); The following new named values will be added: enum virDomainBlockJobType (used in virDomainBlockJobInfo): VIR_DOMAIN_BLOCK_JOB_TYPE_COPY = 2 The following new flags will be added: for virDomainBlockRebase: VIR_DOMAIN_BLOCK_REBASE_SHALLOW = 1 << 0 VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT = 1 << 1 VIR_DOMAIN_BLOCK_REBASE_COPY = 1 << 2 for virDomainBlockCopy: VIR_DOMAIN_BLOCK_COPY_SHALLOW = 1 << 0 VIR_DOMAIN_BLOCK_COPY_REUSE_EXT = 1 << 1 for virDomainBlockJobAbort: VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT Add some XML: Under //domain/drivers/disk, next to <source file='...'/>, add <mirror file='...'/> Semantics ========= virDomainBlockCopy sets up a BLOCK_JOB_TYPE_COPY job. 'disk' names the disk to be copied (can be 'vda' or '/path/to/source', as with other block commands) and must not be NULL. 'base' names the path to the backing file in the chain of the source that will be the new backing file of the destination; if this parameter is NULL, then the destination file defaults to a complete block pull, but the COPY_SHALLOW flag instead requests a pull of just the top file in the source backing chain. 'dest' names the copy being created and must not be NULL; normally, this file is created by the hypervisor/libvirt, but the COPY_REUSE_EXT flag lets an application pass in a pre-created file (allowing metadata to include a relative instead of absolute backing file name). 'format' gives the format of the copy, or NULL to either probe the format of a COPY_REUSE_EXT dest or to reuse the same format as the source. flags cannot contain COPY_SHALLOW unless 'base' is NULL. Once a block copy job is started, calls to virDomainGetBlockJobInfo() for the same 'disk' will report an info with VIR_DOMAIN_BLOCK_JOB_TYPE_COPY as the type. This job never completes on its own, but must be stopped by the user (this enables mirroring to continue until the user informs libvirt that any backing files, perhaps located at different locations as specified by relative path names using REUSE_EXT, have been externally copied into place). There are two phases to a TYPE_COPY job. In the first phase, cur < end when querying progress, calls to virDomainBlockJobAbort(dom, disk, 0) will cancel the operation and revert to the source, and calls to virDomainBlockJobAbort(dom, disk, VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) will fail with VIR_ERR_BLOCK_COPY_IN_PROGRESS. In the second phase, cur == end when querying progress, calls to virDomainBlockJobAbort(dom, disk, 0) will break the mirroring and revert to the source, while calls to virDomainBlockJobAbort(dom, disk, VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) will break the mirroring and pivot to the destination. Use of VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT on a non-copy block job will fail with VIR_ERR_INVALID_ARG. virDomainBlockRebase(dom, disk, dest, bandwidth, VIR_DOMAIN_BLOCK_REBASE_COPY | (flags & 3)) is shorthand for virDomainBlockCopy(dom, disk, NULL, dest, NULL, bandwidth, (flags & 3)) - that is, use of the REBASE_COPY flag treats the BlockRebase 'base' argument as the BlockCopy 'dest' argument, creates the destination with the same file format as the source (or probes the backing file format if REBASE_REUSE_EXT is used), and passes the COPY_SHALLOW and COPY_REUSE_EXT flags through (note that the similarly named flags were conveniently chosen to be the same values). Attempts to use REBASE_SHALLOW or REBASE_REUSE_EXT without also using REBASE_COPY will fail with VIR_ERR_INVALID_ARG. While a copy operation is in place, virDomainGetXMLDesc (dumpxml) will show the <mirror> element for that <disk>. Initial Implementation ====================== When virDomainBlockCopy is called (perhaps via the virDomainBlockRebase alias), it first sets up a mirror using the 'drive-mirror' monitor command and the destination file name. The mirror is opened with the 'existing' mode if _REUSE_EXT is present; otherwise it is opened with the 'absolute-paths' mode if _SHALLOW is present or 'no-backing-file' mode if no flags are present. Next, the function calls the 'block_stream' monitor command to start the streaming. The streaming command uses 'base' as its starting point, except that when _SHALLOW was specified, libvirt will use the backing file of the source disk, rather than NULL (this can be obtained by the 'query-block' monitor command; although someday libvirt should start tracking this information in <domain> XML rather than relying on qemu). At this point, control returns to the user, and the stream proceeds in the background; virDomainBlockJobSetSpeed can tune the speed of the block streaming. At least in the initial implementation, as long as the block job is active, libvirt will prevent 'virDomainMigrate', 'virDomainSave', 'virDomainSnapshotCreateXML' in general, and 'virDomainDetachDevice' of the disk being mirrored, all with the new VIR_ERR_BLOCK_COPY_IN_PROGRESS error. This is because I don't have an easy way to resume mirroring when restarting a new qemu process; preventing these actions until the user first cancels the ongoing mirroring will result in fewer corner cases that libvirt has to worry about. This also implies that the initial implementation will fail for persistent domains, and only be useful for transient domains. Attempts to define a domain with a <mirror> element are rejected, leaving <mirror> as output-only XML useful in restoring state when restarting libvirtd. It may be possible to add persistent support in the future, once we determine how to make qemu resume a mirrored block device; at that point, it would be possible to specify <mirror> in domain xml during domain creation or during device hotplug. When the block stream finishes, qemu will send an event to libvirt (libvirt will also have to manually check for completion on a libvirtd restart, based on whether cur == end in the block job info). I'm not yet sure whether to expose this event to the user so that they do not have to poll the block job info, or whether to consume it internally. At any rate, before this event occurs, the BLOCK_JOB_ABORT_PIVOT flag is rejected, and virDomainBlockJobAbort without flags uses the 'block_job_cancel' monitor command to stop the streaming early, then the 'drive-reopen' monitor command to break the mirroring back to the source; it is feasible that there is a race where a 'block_job_cancel' can be called after the pull is complete but before the completion event has been processed, so the code must proceed on to the 'drive-reopen' even if the job cancel fails. After the event occurs, virDomainBlockJobAbort only needs to use the 'drive-reopen' monitor command, with either the source or the destination file depending on the BLOCK_JOB_ABORT_PIVOT flag. Until 'drive-reopen' is made atomic in qemu (by adding code to support it inside 'transaction'), the user risks a block job abort rendering the disk unusable, because the source was closed before the destination was opened; hopefully this situation is rare, in part because libvirt will do stat() checks and SELinux labeling on destination files before starting qemu monitor commands, as a sanity check that qemu will be able to use the specified files. If qemu ever adds atomic 'drive-reopen' support, we can add a new flag BLOCK_JOB_ABORT_ATOMIC that fails on older qemu, and ensures the use of 'transaction' on the newer qemu that supports an atomic reopen. If a block mirror is aborted (whether by the user calling virDomainBlockJobAbort with no flag, or by the qemu process ending due to things like a guest-initiated shutdown), then the mirror can be safely discarded, and restarting the domain will be unmirrored where the virDomainBlockRebase can be called again from scratch. Examples ======== For some examples, starting with base <- snap1 <- snap2 <- snap3 as the backing chain for disk 'vda', virDomainBlockCopy(dom, "vda", NULL, "/path/to/copy", NULL, 0) would set up a job that results in /path/to/copy being the same file format as snap3, but containing the entire chain virDomainBlockCopy(dom, "vda", "/path/to/snap1", "/path/to/copy", "qed", 0) would set up a job that results in base <- snap1 <- copy as the mirrored backing chain, and ensuring that copy is formatted as qed regardless of the format of snap3 virDomainBlockCopy(dom, "vda", NULL, "/path/to/copy", NULL, VIR_DOMAIN_BLOCK_COPY_SHALLOW) is shorthand for virDomainBlockCopy(dom, "vda", "/path/to/snap2", "/path/to/copy", NULL, 0) and results in base <- snap1 <- snap2 <- copy creates copy using the same format as snap3 virDomainBlockCopy(dom, "vda", "/path/to/snap2", "/path/to/copy", NULL, VIR_DOMAIN_BLOCK_COPY_REUSE_EXT) requires /path/to/copy to already exist, probes it for existing format (which might be different from snap3), and proceeds to mirror everything so that snap2 is the base of copy (and the user is at fault if the pre-existing file doesn't call out a backing file that happens to be identical in content to snap2) oVirt will probably use the sequence: - use qemu-img to create an empty qcow2 file with relative backing name to the destination storage - call virDomainBlockRebase(dom, disk, "/path/to/copy", VIR_DOMAIN_BLOCK_REBASE_COPY | VIR_DOMAIN_BLOCK_REBASE_SHALLOW | VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT) - copy the base files from source to destination storage (this can be done in parallel, either before or after the virDomainBlockRebase call) - wait for the block pull to finish (either by waiting for an event if I propagate the event, or by polling virDomainBlockJobInfo, or even by polling virDomainBlockJobAbort(VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) and checking for VIR_ERR_BLOCK_COPY_IN_PROGRESS - once both the base files are in place and the block pull half of the copy job is complete (and without regard to whether the block stream or the external base file copying completed first), call virDomainBlockJobAbort(,VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) to reopen to the new storage domain chain Comparison to first RFC ======================= This proposal exposes only one disk at a time, while the earlier virDomainSnapshotCreateXML <mirror> approach could atomically set up mirroring on multiple disks. However, the nature of block jobs being a background process means that parallel jobs can be run on independent disks, so the user can do the overall block migration with the time cost of the slowest disk, rather than having to do things serially with the time cost of all disks added together. This proposal avoids having to create an intermediate snapshot, so the pull is more efficient and the source chain does not get longer, no matter how many times the process is aborted and restarted. This proposal can expose the no-backing-file mode, while the snapshot approach did not. To survive across libvirtd restarts, the snapshot approach was using <domainsnapshot> to store the mirroring status in a user-visible location; this approach has to modify the internal live xml (alongside other internal data, such as the qemu pid). Or perhaps I can add a <mirror> subelement to <disk> of a domain and make it user-visible after all, and treat that as an output-only parameter for now. Both approaches face the dilemma of how to start a new qemu process with mirroring intact, and my solution in both patch series will be to prevent any action that would force libvirt to save domain state until after the user has first canceled all current mirroring jobs. This limitation is not permanent - if future qemu provides better ways to restart mirroring, and as libvirt is taught to store the full backing chain in <domain> xml instead of probing it on the fly, we can relax this restriction in the future. -- Eric Blake eblake@xxxxxxxxxx +1-919-301-3266 Libvirt virtualization library http://libvirt.org
Attachment:
signature.asc
Description: OpenPGP digital signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list