----- Original Message ----- > From: "Eric Blake" <eblake@xxxxxxxxxx> > To: "Francesco Romani" <fromani@xxxxxxxxxx> > Cc: libvir-list@xxxxxxxxxx, "Nir Soffer" <nsoffer@xxxxxxxxxx>, "Peter Krempa" <pkrempa@xxxxxxxxxx>, > qemu-devel@xxxxxxxxxx > Sent: Friday, May 22, 2015 6:33:01 AM > Subject: Re: RFC: exposing qemu's block-set-write-threshold > > [adding qemu] > > > I read the thread and I'm pretty sure this will be a silly question, but I > > want > > to make sure I am on the same page and I'm not somehow confused by the > > terminology. > > > > Let's consider the simplest of the situation we face in oVirt: > > > > (thin provisioned qcow2 disk on LV) > > > > vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID] > > > > Isn't the LV here the 'backing file' (actually, backing block device) of > > the disk? > > Restating what you wrote into libvirt terminology, I think this means > > that you have a <disk> where: > <driver> is qcow2 > <source> is a local file name > <device> names vda > <backingStore index='1'> describes the backing LV: > <driver> is also qcow2 (as polling allocation growth in order to > resize on demand only makes sense for qcow2 format) > <source> is /dev/mapper/$UUID Yes, exactly my point. I just want to be 100% sure that the three (slightly) different parlances of the three groups (oVirt/libvirt/QEMU) are aligned on the same meaning, and that we're not getting anything lost in translation > that you have a <disk> where: > <driver> is qcow2 > <source> is a local file name > <device> names vda > <backingStore index='1'> describes the backing LV: > <driver> is also qcow2 (as polling allocation growth in order to > resize on demand only makes sense for qcow2 format) > <source> is /dev/mapper/$UUID For the final confirmation, here's the actual XML we produce: <disk device="disk" snapshot="no" type="block"> <address bus="0x00" domain="0x0000" function="0x0" slot="0x05" type="pci"/> <source dev="/rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5"/> <target bus="virtio" dev="vda"/> <serial>ee1295ee-7ddc-4030-be5e-4557538bc4d2</serial> <boot order="1"/> <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2"/> </disk> For the sake of completeness: $ ls -lh /rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 lrwxrwxrwx. 1 vdsm kvm 78 May 22 08:49 /rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 -> /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/05a88a94-5bd6-4698-be69-39e78c84e1a5 $ ls -lh /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/ total 0 lrwxrwxrwx. 1 root root 8 May 22 08:49 05a88a94-5bd6-4698-be69-39e78c84e1a5 -> ../dm-11 lrwxrwxrwx. 1 root root 8 May 22 08:49 54673e6d-207d-4a66-8f0d-3f5b3cda78e5 -> ../dm-12 lrwxrwxrwx. 1 root root 9 May 22 08:49 ids -> ../dm-606 lrwxrwxrwx. 1 root root 9 May 22 08:49 inbox -> ../dm-607 lrwxrwxrwx. 1 root root 9 May 22 08:49 leases -> ../dm-605 lrwxrwxrwx. 1 root root 9 May 22 08:49 master -> ../dm-608 lrwxrwxrwx. 1 root root 9 May 22 08:49 metadata -> ../dm-603 lrwxrwxrwx. 1 root root 9 May 22 08:49 outbox -> ../dm-604 lvs | grep 05a88a94 05a88a94-5bd6-4698-be69-39e78c84e1a5 12f68692-2a5a-4e48-af5e-4679bca7fd44 -wi-ao---- 14.12g > then indeed, "vda" is the local qcow2 file, and "vda[1]" is the backing > file on the LV storage. > > Normally, you only care about the write threshold at the active layer > (the local file, with name "vda"), because that is the only image that > will normally be allocating sectors. But in the case of active commit, > where you are taking the thin-provisioned local file and writing its > clusters back into the backing LV, the action of commit can allocate > sectors in the backing file. Right > Thus, libvirt wants to let you set a > write-threshold on both parts of the backing chain (the active wrapper, > and the LV backing file), where the event could fire on either node > first. The existing libvirt virConnectDomainGetAllStats() can already > be used to poll allocation growth (the block.N.allocation statistic in > libvirt, or 'virtual-size' in QMP's 'ImageInfo'), but the event would > let you drop polling. Yes, exactly the intent > However, while starting to code the libvirt side of things, I've hit a > couple of snags with interacting with the qemu design. First, the > 'block-set-write-threshold' command is allowed to set a threshold by > 'node-name' (any BDS, whether active or backing), Yes, this emerged during the review of my patch. I first took the simplest approach (probably simplistic, in retrospect), but -IIRC- was pointed out that setting by node-name grants the most flexible approach, hence was required. See: http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02503.html http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02580.html http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02831.html > but libvirt is not yet > setting 'node-name' for backing files (so even though libvirt knows how > to resolve "vda[1]" to the backing chain, I had vague memories of this, hence my clumsy and poorly worded question about how to resolve 'vda[1]' before :\ > it does not yet have a way to > tell qemu to set the threshold on that BDS until libvirt starts naming > all nodes). Second, querying for the current threshold value is only > possible in struct 'BlockDeviceInfo', which is reported as the top-level > of each disk in 'query-block', and also for 'query-named-block-nodes'. > However, when it comes to collecting block.N.allocation, libvirt is > instead getting information from the sub-struct 'ImageInfo', which is > reported recursively for BlockDeviceInfo in 'query-block' but not > reported for 'query-named-block-nodes'. IIRC 'query-named-block-nodes' was the preferred way to extract this information (see also http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02944.html ) > So it is that much harder to > call 'query-named-block-nodes' and then correlate that information back > into the tree of information for anything but the active image. So it > may be a while before thresholds on "vda[1]" actually work for block > commit; my initial implementation will just focus on the active image "vda". > > I'm wondering if qemu can make it easier by duplicating threshold > information into 'ImageInfo' rather than just 'BlockDeviceInfo', so that > a single call to 'query-block' rather than a second call to > 'query-named-block-nodes' can scrape the threshold information for every > BDS in the chain. I think I've just not explored this option back in time, just had vague feeling that was better not to duplicate information, but I can't recall real solid reason on my side. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list