Re: RFC: exposing qemu's block-set-write-threshold

Francesco Romani <fromani@xxxxxxxxxx> · Fri, 22 May 2015 03:24:07 -0400 (EDT)

----- Original Message -----
> From: "Eric Blake" <eblake@xxxxxxxxxx>
> To: "Francesco Romani" <fromani@xxxxxxxxxx>
> Cc: libvir-list@xxxxxxxxxx, "Nir Soffer" <nsoffer@xxxxxxxxxx>, "Peter Krempa" <pkrempa@xxxxxxxxxx>,
> qemu-devel@xxxxxxxxxx
> Sent: Friday, May 22, 2015 6:33:01 AM
> Subject: Re:  RFC: exposing qemu's block-set-write-threshold
> 
> [adding qemu]
> 

> > I read the thread and I'm pretty sure this will be a silly question, but I
> > want
> > to make sure I am on the same page and I'm not somehow confused by the
> > terminology.
> > 
> > Let's consider the simplest of the situation we face in oVirt:
> > 
> > (thin provisioned qcow2 disk on LV)
> > 
> > vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID]
> > 
> > Isn't the LV here the 'backing file' (actually, backing block device) of
> > the disk?
> 
> Restating what you wrote into libvirt terminology, I think this means
>
> that you have a <disk> where:
> <driver> is qcow2
> <source> is a local file name
> <device> names vda
> <backingStore index='1'> describes the backing LV:
>   <driver> is also qcow2 (as polling allocation growth in order to
> resize on demand only makes sense for qcow2 format)
>   <source> is /dev/mapper/$UUID

Yes, exactly my point. I just want to be 100% sure that the three (slightly) different
parlances of the three groups (oVirt/libvirt/QEMU) are aligned on the same meaning,
and that we're not getting anything lost in translation

> that you have a <disk> where:
> <driver> is qcow2
> <source> is a local file name
> <device> names vda
> <backingStore index='1'> describes the backing LV:
>   <driver> is also qcow2 (as polling allocation growth in order to
> resize on demand only makes sense for qcow2 format)
>   <source> is /dev/mapper/$UUID

For the final confirmation, here's the actual XML we produce:

<disk device="disk" snapshot="no" type="block">
  <address bus="0x00" domain="0x0000" function="0x0" slot="0x05" type="pci"/>
  <source dev="/rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5"/>
  <target bus="virtio" dev="vda"/>
  <serial>ee1295ee-7ddc-4030-be5e-4557538bc4d2</serial>
  <boot order="1"/>
  <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2"/>
</disk>

For the sake of completeness:

$ ls -lh /rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 
lrwxrwxrwx. 1 vdsm kvm 78 May 22 08:49 /rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 -> /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/05a88a94-5bd6-4698-be69-39e78c84e1a5

$ ls -lh /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/
total 0
lrwxrwxrwx. 1 root root 8 May 22 08:49 05a88a94-5bd6-4698-be69-39e78c84e1a5 -> ../dm-11
lrwxrwxrwx. 1 root root 8 May 22 08:49 54673e6d-207d-4a66-8f0d-3f5b3cda78e5 -> ../dm-12
lrwxrwxrwx. 1 root root 9 May 22 08:49 ids -> ../dm-606
lrwxrwxrwx. 1 root root 9 May 22 08:49 inbox -> ../dm-607
lrwxrwxrwx. 1 root root 9 May 22 08:49 leases -> ../dm-605
lrwxrwxrwx. 1 root root 9 May 22 08:49 master -> ../dm-608
lrwxrwxrwx. 1 root root 9 May 22 08:49 metadata -> ../dm-603
lrwxrwxrwx. 1 root root 9 May 22 08:49 outbox -> ../dm-604

lvs | grep 05a88a94
  05a88a94-5bd6-4698-be69-39e78c84e1a5 12f68692-2a5a-4e48-af5e-4679bca7fd44 -wi-ao----  14.12g

> then indeed, "vda" is the local qcow2 file, and "vda[1]" is the backing
> file on the LV storage.
> 
> Normally, you only care about the write threshold at the active layer
> (the local file, with name "vda"), because that is the only image that
> will normally be allocating sectors.  But in the case of active commit,
> where you are taking the thin-provisioned local file and writing its
> clusters back into the backing LV, the action of commit can allocate
> sectors in the backing file. 

Right

> Thus, libvirt wants to let you set a
> write-threshold on both parts of the backing chain (the active wrapper,
> and the LV backing file), where the event could fire on either node
> first.  The existing libvirt virConnectDomainGetAllStats() can already
> be used to poll allocation growth (the block.N.allocation statistic in
> libvirt, or 'virtual-size' in QMP's 'ImageInfo'), but the event would
> let you drop polling.

Yes, exactly the intent

> However, while starting to code the libvirt side of things, I've hit a
> couple of snags with interacting with the qemu design.  First, the
> 'block-set-write-threshold' command is allowed to set a threshold by
> 'node-name' (any BDS, whether active or backing),

Yes, this emerged during the review of my patch. 
I first took the simplest approach (probably simplistic, in retrospect),
but -IIRC- was pointed out that setting by node-name grants the most
flexible approach, hence was required.

See:
http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02503.html
http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02580.html
http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02831.html

> but libvirt is not yet
> setting 'node-name' for backing files (so even though libvirt knows how
> to resolve "vda[1]" to the backing chain, 

I had vague memories of this, hence my clumsy and poorly worded question
about how to resolve 'vda[1]' before :\

> it does not yet have a way to
> tell qemu to set the threshold on that BDS until libvirt starts naming
> all nodes).  Second, querying for the current threshold value is only
> possible in struct 'BlockDeviceInfo', which is reported as the top-level
> of each disk in 'query-block', and also for 'query-named-block-nodes'.
> However, when it comes to collecting block.N.allocation, libvirt is
> instead getting information from the sub-struct 'ImageInfo', which is
> reported recursively for BlockDeviceInfo in 'query-block' but not
> reported for 'query-named-block-nodes'.

IIRC 'query-named-block-nodes' was the preferred way to extract this
information (see also
http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02944.html )

>  So it is that much harder to
> call 'query-named-block-nodes' and then correlate that information back
> into the tree of information for anything but the active image. So it
> may be a while before thresholds on "vda[1]" actually work for block
> commit; my initial implementation will just focus on the active image "vda".
> 
> I'm wondering if qemu can make it easier by duplicating threshold
> information into 'ImageInfo' rather than just 'BlockDeviceInfo', so that
> a single call to 'query-block' rather than a second call to
> 'query-named-block-nodes' can scrape the threshold information for every
> BDS in the chain. 

I think I've just not explored this option back in time, just had
vague feeling that was better not to duplicate information, but I can't recall
real solid reason on my side.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list