Re: RFC: exposing qemu's block-set-write-threshold

Eric Blake <eblake@xxxxxxxxxx> · Thu, 21 May 2015 22:33:01 -0600

[adding qemu]

On 05/21/2015 07:49 AM, Francesco Romani wrote:
> (CCing Nir, from the oVirt/RHEV storage team)
> 
> ----- Original Message -----
>> From: "Eric Blake" <eblake@xxxxxxxxxx>
>> To: "Peter Krempa" <pkrempa@xxxxxxxxxx>
>> Cc: libvir-list@xxxxxxxxxx
>> Sent: Tuesday, May 19, 2015 2:42:16 PM
>> Subject: Re:  RFC: exposing qemu's block-set-write-threshold
> 
> Hi, sorry for pretty late joining.
> Let me add a few notes from the perspective of a consumer (although
> very interested! :) of the API.

Since you wrote the qemu side with the intent of being the end user, I
welcome the feedback.  (qemu commit e2462113, for those joining the
conversation)

> 
> I read the thread and I'm pretty sure this will be a silly question, but I want
> to make sure I am on the same page and I'm not somehow confused by the terminology.
> 
> Let's consider the simplest of the situation we face in oVirt:
> 
> (thin provisioned qcow2 disk on LV)
> 
> vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID]
> 
> Isn't the LV here the 'backing file' (actually, backing block device) of the disk?

Restating what you wrote into libvirt terminology, I think this means
that you have a <disk> where:
<driver> is qcow2
<source> is a local file name
<device> names vda
<backingStore index='1'> describes the backing LV:
  <driver> is also qcow2 (as polling allocation growth in order to
resize on demand only makes sense for qcow2 format)
  <source> is /dev/mapper/$UUID

then indeed, "vda" is the local qcow2 file, and "vda[1]" is the backing
file on the LV storage.

Normally, you only care about the write threshold at the active layer
(the local file, with name "vda"), because that is the only image that
will normally be allocating sectors.  But in the case of active commit,
where you are taking the thin-provisioned local file and writing its
clusters back into the backing LV, the action of commit can allocate
sectors in the backing file.  Thus, libvirt wants to let you set a
write-threshold on both parts of the backing chain (the active wrapper,
and the LV backing file), where the event could fire on either node
first.  The existing libvirt virConnectDomainGetAllStats() can already
be used to poll allocation growth (the block.N.allocation statistic in
libvirt, or 'virtual-size' in QMP's 'ImageInfo'), but the event would
let you drop polling.

However, while starting to code the libvirt side of things, I've hit a
couple of snags with interacting with the qemu design.  First, the
'block-set-write-threshold' command is allowed to set a threshold by
'node-name' (any BDS, whether active or backing), but libvirt is not yet
setting 'node-name' for backing files (so even though libvirt knows how
to resolve "vda[1]" to the backing chain, it does not yet have a way to
tell qemu to set the threshold on that BDS until libvirt starts naming
all nodes).  Second, querying for the current threshold value is only
possible in struct 'BlockDeviceInfo', which is reported as the top-level
of each disk in 'query-block', and also for 'query-named-block-nodes'.
However, when it comes to collecting block.N.allocation, libvirt is
instead getting information from the sub-struct 'ImageInfo', which is
reported recursively for BlockDeviceInfo in 'query-block' but not
reported for 'query-named-block-nodes'.  So it is that much harder to
call 'query-named-block-nodes' and then correlate that information back
into the tree of information for anything but the active image. So it
may be a while before thresholds on "vda[1]" actually work for block
commit; my initial implementation will just focus on the active image "vda".

I'm wondering if qemu can make it easier by duplicating threshold
information into 'ImageInfo' rather than just 'BlockDeviceInfo', so that
a single call to 'query-block' rather than a second call to
'query-named-block-nodes' can scrape the threshold information for every
BDS in the chain.  Then again, I know there is work underway to refactor
qemu block throttling, to allow throttle values on backing images that
differ from the active image; and throttling is currently reported in
'BlockDeviceInfo'.  So I'm not sure yet if adding redundant information
in 'ImageInfo' would help anything, or get in the way of the throttle
group work.

>>> Since virDomainSetBlockIoTune operates on disk-level and the event will
>>> need to be registered on a backing-chain element level, using
>>> virDomainSetBlockIoTune won't be a good choice, IMO.
>>
>> Oh, I wasn't even thinking about backing-chain levels.  You are right -
>> it is entirely possible for someone to want to set a threshold reporting
>> for "vda[1]" (the backing element of "vda") - of course, it would only
>> kick in during a blockcommit to vda[1], but it is still important to
>> allow.  But the BlockIoTune XML is not (yet) geared for backing images.
> 
> Agreed about SetBlockIoTune not looking the best choice yet.

Except that SetBlockIoTune is the one libvirt API that is currently
using the throttle information in 'BlockDeviceInfo', and will soon have
to be extended to manage throttle groups and throttle information on
backing files anyways.

> 
>>>
>>> As for the @disk parameter it will need to take the target with the
>>> index argument since I know that oVirt is using the same approach also
>>> for backing chain sub-elements hosted on LVM when doing snapshot
>>> merging via the block job APIs.
> 
> That's correct
> 

Hopefully I can get the initial support for "vda" events in, then we can
tackle the additional work to add "vda[1]" events.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment:
signature.asc

Description: OpenPGP digital signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list