tl;dr: I am working on a series of patches to expose backing chain information in <domain> XML. Comments are welcome, to make sure my XML design is on the right track. Purpose ======= Among other things, this will help us support Peter's proposal of enhancing the block-pull and block-commit actions to specify a destination by relative depth in the backing chain (where "vda[0]" represents the active image, "vda[1]" represents the backing file of the active image, and so on). It will also help debug situations where libvirt and qemu disagree on what constitutes a backing chain, and therefore causes sVirt labeling discrepancies or prohibits block-pull/block-commit actions. For example, given the chain "base <- mid <- top", if top forgot the backing_fmt attribute, and /etc/libvirt/qemu.conf allow_disk_format_probing=0 (which it is by default for security resasons), libvirt treats 'mid' as a raw file and refuses to acknowledge that 'base' is part of the chain, while qemu would happily treat mid as qcow2 and therefore use 'base' if permissions allow it to. I have helped debug this scenario several times on IRC or in bugzilla reports. This feature is being driven in part by https://bugzilla.redhat.com/show_bug.cgi?id=1069407 Existing design =============== Note that libvirt can already expose backing file details (but only one layer; it is not recursive) when using virStorageVolGetXMLDesc(); for example: # virsh vol-dumpxml --pool gluster img3 <volume type='network'> <name>img3</name> <key>vol1/img3</key> ... <target> <path>gluster://localhost/vol1/img3</path> <format type='qcow2'/> ... </target> <backingStore> <path>gluster://localhost/vol1/img2</path> <format type='qcow2'/> <permissions> <mode>00</mode> <owner>0</owner> <group>0</group> </permissions> </backingStore> </volume> In the current volume representation, if a <backingStore> element is present, it gives the <path> to the backing file. But this representation is a bit limited: it is rather hard-coded to the assumption that there is only one backing file, and does not do a good job when the backing image is not in the same storage pool as the volume it is describing. Some of the enhancements I'm proposing for <domain> should also be applied to the information output by <volume> XML, which means I have to be careful that the design I'm proposing will mesh well with the storage xml to maximize code reuse. The volume approach is a bit painful to users trying to track the backing chain of a disk tied to a <domain> because it necessitates creating a storage pool and making multiple calls to follow the chain, so we need to expose the backing chain directly in the <disk> element of a domain, and recursively show the entire chain. Furthermore, there are some formats that require multiple resources: for example, both qemu 2.0's new quorum driver and HyperV VHDX images can have multiple backing files, and where these files can in turn have more backing images. Thus, any proper representation of disk resources needs to show a full tree of relationships. Thankfully, circular references in backing files would form an invalid image (all known virtual disk image formats require a DAG of relationships). With existing API, we still have not fully implemented 'virsh snapshot-delete' of external snapshots. So our current advice is for people to manually use qemu-img to alter backing chains, then update libvirt to match. Once libvirt starts tracking backing chains, it becomes all the more important to provide two new actions in libvirt: we need a validation mode (check that what is recorded on disk matches what is recorded in XML and flag an error if they differ) and a correction mode (ignore what is recorded in XML and regenerate it to match what is actually on disk). Proposal ======== For each <disk> of a domain, I will be adding a new <backingStore> element. The element is optional on input, which allows libvirt to continue to understand input from older versions, but will always be present on output, to show what libvirt is tracking as the backing chain. For a file with no backing store (including raw file format), the usage is simple: <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/path/to/somewhere'/> <backingStore/> <target dev='vda' bus='virtio'/> </disk> The new explicit <backingStore/> makes it clear that there is no backing chain. A backing chain of 3 files (base <- mid <- top) in the local file system: <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <target dev='vda' bus='virtio'/> </disk> Note that this is intentionally nested, so that for file formats that support more than one backing resource, it can list parallel <backingStore> as siblings to describe those related resources (thus leaving the door open to expose a qemu quorum as a <disk type='quorum'> with no direct <source> but instead with three <backingStore> sibling elements for each member of the quorum, and where each member of the quorum can further have its own backing chain). Design wise, the <backingStore> element is either completely empty (end-of-chain), or has a mandatory type='...' attribute that mirrors the same type attribute of a <disk>. Then, within the backingStore element, there is a <source> or other appropriate sub-elements similar to what <disk> already uses for describing a single host resource. So, for an example, here would be the output for a 2-element chain on gluster: <disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='gluster' name='vol1/img2'> <host name='red'/> </source> <backingStore type='network'> <driver name='qemu' type='qcow2'/> <source protocol='gluster' name='vol1/img1'> <host name='red'/> </source> <backingStore/> </backingStore> <target dev='vdb' bus='virtio'/> </disk> Or again, but this time using volume references to a storage pool (assuming 'glusterVol1' is the storage pool wrapping gluster://red/vol1): <disk type='volume' device='disk'> <driver name='qemu' type='qcow2'/> <source pool='glusterVol1' volume='img2'/> <backingStore type='volume'> <driver name='qemu' type='qcow2'/> <source pool='glusterVol1' volume='img1'/> <backingStore/> </backingStore> <target dev='vdb' bus='virtio'/> </disk> As can be seen, this design heavily reuses existing <disk type='...'> handling, which should make it easier to reuse blocks of code both in libvirt to handle the backing chains, and in clients when processing backing chains to hand to libvirt up front or in inspecting the dumpxml results. Management apps like vdsm that use transient domains should start supplying <backingStore> elements to fully describe chains. Implementation ============== The following APIs will be affected: defining domain XML (whether via define for persistent domains, or create for transient domains): parse the new element. If the element is already present, default to trusting the backing chain in that element instead of reading from the disk files. If the element is absent, read the disk files and populate the element. It is probably also worth adding a flag to trigger validation mode: read the disk files to ensure they match the xml, and refuse the operation if there is a mismatch (as for updating xml to match reality, the simplest is to edit the XML and delete the <backingStore> element then try the define again, so I don't see the need for a flag for that action). I may also need to figure out if it is worth tainting a domain any time where libvirt detects that the XML backing chain vs. the disk file read backing chain have diverged. Note that defining domain XML includes loading from saved state or from incoming migration. dumping domain XML: always output the new element, by default without consulting disk files. By tracking the chain in memory ever since the guest is defined, it should already be available for output. I'm debating whether we need a flag (similar to virsh dumpxml --update-cpu) that can force libvirt to re-read the disk files at the time of the dump and regenerate the chain to match reality of any changes made behind libvirt's back. creating external snapshots: the <domainsnapshot> XMl will continue to be the picture of the domain prior to the creation of the snapshot (but this picture will now include any <backingStore> elements already present in the chain), but after the snapshot is taken, the <domain> XML will also be modified to record the updated chain (the old disk source is now the <backingStore> of the new disk source). deleting external snapshots is not yet implemented, but the implementation will have to shrink the backingStore chain to match reality. block-pull (block-rebase in pull mode), block-commit: at the completion of the pull, the <backingStore> needs to be updated to reflect the new shorter state of the chain block-copy (block-rebase in copy mode): the operation starts out by creating a mirror, but during the first phase, the mirror is not usable as an accurate copy of what the guest sees. Right now we fudge by saying that block copy can only be done on transient domains; but even with that, we still track a <mirror> element in the <disk> XML to track that a block copy is underway (so that the operation survives a libvirtd restart). The <mirror> element will now need to be taught a <backingStore>, particularly if the user passes in a pre-existing file to be reused as the copy destination. Then, when the second phase is complete and the mirroring is ended, the <disk> will need another update to select which side of the backing chain is now in force virsh domblklist: should be taught a new flag to show the backing chain in a tree format, since the command already exists to extract <disk> information from a domain into a nicer human format sVirt security labeling: right now, we are read the disk files to both label and remove labels on a backing chain - obviously, once the chain is tracked natively as part of the <disk>, we should be labeling without having to read disk files storage volumes - investigate how much of the backing chain code can be reused in enhancing storage volume xml output anything else you can think of in the code base that will be impacted? -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
Attachment:
signature.asc
Description: OpenPGP digital signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list