Re: libvirt-devaddr: a new library for device address assignment

Laine Stump <laine@xxxxxxxxx> · Thu, 19 Mar 2020 15:00:09 -0400

TL;DR - I'm not as anti-XML as the proposal seems to be, but also not 
pro-XML. I also (after thinking about it) understand the advantage of 
putting this in a separate library. So yeah, let's go it!

On 3/13/20 6:47 AM, Daniel P. Berrangé wrote:
On Fri, Mar 13, 2020 at 11:23:44AM +0200, Dan Kenigsberg wrote:
On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange@xxxxxxxxxx> wrote:
We've been doing alot of refactoring of code in recent times, and also
have plans for significant infrastructure changes. We still need to
spend time delivering interesting features to users / applications.
This mail is to introduce an idea for a solution to an specific
area applications have had long term pain with libvirt's current
"mechanism, not policy" approach - device addressing. This is a way
for us to show brand new ideas & approaches for what the libvirt
project can deliver in terms of management APIs.

To set expectations straight: I have written no code for this yet,
merely identified the gap & conceptual solution.

The device addressing problem
=============================

One of the key jobs libvirt does when processing a new domain XML
configuration is to assign addresses to all devices that are present.
This involves adding various device controllers (PCI bridges, PCI root
ports, IDE/SCSI buses, USB controllers, etc) if they are not already
present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each
device so they are associated with controllers. When libvirt spawns a
QEMU guest, it will pass full address information to QEMU.

Libvirt, as a general rule, aims to avoid defining and implementing
policy around expansion of guest configuration / defaults, however, it
is inescapable in the case of device addressing due to the need to
guarantee a stable hardware ABI to make live migration and save/restore
to disk work.  The policy that libvirt has implemented for device
addressing is, as much as possible, the same as the addressing scheme
QEMU would apply itself.

While libvirt succeeds in its goal of providing a stable hardware API,
the addressing scheme used is not well suited to all deployment
scenarios of QEMU. This is an inevitable result of having a specific
assignment policy implemented in libvirt which has to trade off mutually
incompatible use cases/goals.

When the libvirt addressing policy is not been sufficient, management
applications are forced to take on address assignment themselves,
which is a massive non-trivial job with many subtle problems to
consider.

Places where libvirt's addressing is insufficient for PCI include

  * Setting up multiple guest NUMA nodes and associating devices to
    specific nodes
  * Pre-emptive creation of extra PCIe root ports, to allow for later
    device hotplug on PCIe topologies
  * Determining whether to place a device on a PCI or PCIe bridge
  * Controlling whether a device is placed into a hotpluggable slot
  * Controlling whether a PCIe root port supports hotplug or not
  * Determining whether to places all devices on distinct slots or
    buses, vs grouping them all into functions on the same slot
  * Ability to expand the device addressing without being on the
    hypervisor host
(I don't understand the last bullet point)
I'm not sure if this is still the case, but at some point in time
there was a desire from KubeVirt to be able to expand the users'
configuration when loaded in KubeVirt, filling in various defaults
for devices. This would run when the end user YAML/JSON config
was first posted to the k8s API for storage, some arbitrary amount
of time later the config gets chosen to run on a virtualization
host at which point it is turned into libvirt domain XML.

If I recall the discussion properly, the context was that we wanted 
kubevirt to remember all the stuff like PCI addresses, MAC addresses, 
exact machinetype to be "backfilled" from libvirt into the Kubevirt 
config, but for them that's a one-way street. So having all these things 
set by a separate API (even in a separate library) would definitely be 
an advantage for them, as long as all the same info was available at 
that time (e.g. you really need to know the machinetypes supported by 
the specific qemu that is going to be used in order to set the exact 
machinetype)

Libvirt wishes to avoid implementing many different address assignment
policies. It also wishes to keep the domain XML as a representation
of the virtual hardware, not add a bunch of properties to it which
merely serve as tunable input parameters for device addressing
algorithms.

There is thus a dilemma here. Management applications increasingly
need fine grained control over device addressing, while libvirt
doesn't want to expose fine grained policy controls via the XML.

The new libvirt-devaddr API
===========================

The way out of this is to define a brand new virt management API
which tackles this specific problem in a way that addresses all the
problems mgmt apps have with device addressing and explicitly
provides a variety of policy impls with tunable behaviour.

By "new API", I actually mean an entirely new library, completely
distinct from libvirt.so, or anything else we've delivered so
far.

I was at first against the idea of a completely separate library, since 
each new library means a new package to be maintained and installed. 
However, I do see the advantage of being completely disconnected from 
libvirt, since there may be scenarios where libvirt isn't needed (maybe 
libvirt is on a different host, or maybe something else (libvirt-ng? 
:-P) is being used. Keeping this separate means it can be used in other 
scenarios. So now I agree with this.

The closest we've come to delivering something at this kind
of conceptual level, would be the abortive attempt we made with
"libvirt-builder" to deliver a policy-driven API instead of mechanism
based. This proposal is still quite different from that attempt.

At a high level

  * The new API is "libvirt-devaddr" - short for "libvirt device addressing"

It's more than just device addresses though. (On the other hand, a name 
is just a name, so...)

  * As input it will take

    1. The guest CPU architecture and machine type

To repeat the point above - do we expect libvirt-devaddr to provide the 
exact machinetype? If so, what will be the mechanism for telling it 
exactly which machinetypes are supported? Will it need to replicate all 
of libvirt's qemu capabilities code? (and would that really work if, 
say, libvirt-devaddr is being used on a machine different from the 
machine where the virtual machine will eventually be run?)

    2. A list of global tunables specifying desired behaviour of the
       address assignment policy
    3. A minimal list of devices needed in the virtual machine, with
       optional addresses and optional per-device tunables to override
       the global tunables

  * As output it will emit

    1. fully expanded list of devices needed in the virtual machine,
       with addressing information sufficient to ensure stable hardware ABI

I know you already know it and it's implied in what you say, but just to 
make sure it's clear to anybody else, the "expanded list of devices" 
will also include all PCI (and SCSI and SATA and whatever) controllers 
needed for the entire hierarchy. (Or maybe you said that and I missed 
it. Wouldn't surprise me)

This means that the library will need to know which types of which 
controllers are supported for the machinetype being requested (and of 
course what is supported by each controller). Is it going to query qemu? 
Which qemu - the one on the host where libvirt-devaddr is being called I 
suppose, but that won't necessarily be the same as the host where the 
guest will eventually run.

Will libvirt-devaddr care about things all the way to the level of which 
type of pcie-root-port to use (for example)?

And what about all the odd attributes of various controllers that 
libvirt sets to a default value and then stores in the XML (chassis id, 
etc)? I guess we need to take care of all those as well.

Initially the API would implement something that behaves the same
way as libvirt's current address assignment API.

The intended usage would be

  * Mgmt application makes a minimal list of devices they want in
    their guest
  * List of devices is fed into libvirt-devaddr API
  * Mgmt application gets back a full list of devices & addresses
  * Mgmt application writes a libvirt XML doc using this full list &
    addresses
  * Mgmt application creates the guest in libvirt

IOW, this new "libvirt-devaddr" API is intended to be used prior to
creating the XML that is used by libvirt. The API could also be used
prior to needing to hotplug a new device to an existing guest.

So everything returned from the original call would need to be kept 
around in that form (or the application would need to be able to 
reproduce it on demand), and that's then fed into the API. I guess this 
could just be the same API - similar to how libvirt acts now, it would 
accept any address info provided, and then assign it wherever it was 
omitted.

This API is intended to be a deliverable of the libvirt project, but
it would be completely independent of the current libvirt API. Most
especially note that it would NOT use the domain XML in any way.
This gives applications maximum flexibility in how they consume this
functionality, not trying to force a way to build domain XML.

I was originally going to argue in favor of using the same XML, since we 
otherwise have to convert back and forth. But during the extra long time 
I've taken to think about it, I think I agree that this isn't important, 
especially if the chosen format is as simple as possible.

This procedure forces Mgmt to learn a new language to describe device
placement. Mgmt (or should I just say "we"?) currently expresses the
"minimal list of devices" in XML form and pass it to libvirt. Here we
are asked to pass it once to libvirt-devaddr, parse its output, and
feed it as XML to libvirt.
I'm not neccessarily suggesting we even need a document format the
core API level. I could easily see the API working in terms of a
list of Go structs, with tunables being normal method parameters.
A JSON format could be an optional way to serialize the Go structs,
but if the app were written in Go the JSON may not be needed at all.

"Using JSON when we eventually need XML is just using XML with extra 
steps". Or something like that. Is JSON really that much simpler than XML?

Anyway, since we aren't saddled with the precondition that "everything 
must be stable and backward compatible", there's freedom to experiment, 
so I guess it's not really necessary to spend too much time debating and 
trying to make the "definite 100% sure best decision". We can just pick 
something and try it. If it works out, great; if it doesn't then we pick 
something else :-)

I believe it would be easier to use the domxml as the base language
for the new library, too. libvirt-devaddr would accept it with various
hints (expressed as its own extension to the XML?) such as "place all
of these devices in the same NUMA node", "keep on root bus" or
"separate these two chattering devices to their own bus". The output
of libvirt-devaddr would be a domxml with <devices> filled with
controllers and addresses, readily available for consumption by
libvirt.
I don't believe that using the libvirt domain XML is a good idea for
this as it uneccesssarily constrains the usage scenarios. Most management
applications do not use the domain XML as their canonical internal storage
format. KubeVirt has its JSON/YAML schema for k8s API, OpenStack/RHEV just
store metadata in their DB, others vary again. Some of these applications
benefit from being able to expand device topology/addressing, a long time
before they get any where near use of domain XML - the latter only matters
when you come to instantiate a VM on a particular host.

This explains why it's not necessary to use XML. But I don't see use of 
XML as "unnecessarily constraining" the usage scenarios. Does it make 
the code (on either side) unnecessarily inefficient? Does it require 
pulling in libraries that applications otherwise wouldn't need? Required 
code is too complex?

We could of coure have a convenience method which optionally generates
a domain XML template from the output list of devices, if someone believes
that's useful to standardize on, but I don't think the domain XML should
be the core format format.

I would also like this library to usable for scenarios in which libvirt
is not involved at all. One of the strange things about the QEMU driver
in libvirt compared to the other hypervisor drivers is that it is missing
an intermediate API layer. In other drivers the hypervisor platform itself
provides a full management API layer, and libvirt merely maps the libvirt
APIs to the underling mgmt API or data formats. IOW, libvirt is just a
mapping layer.

When you're just a "mapping layer", and you're expected to transparently 
map in both directions, it gets problematic. Especially when there are 
multiple ways of describing the same setup, or options supported at one 
end that are ignored/not supported at the other. Not sure why I'm 
replying to this point, just when I hear "mapping layer" I think about 
the fact that netcf was never able to deal with the many different ways 
that debian interfaces files could be written, or ignore but leave in 
place extra ifcfg options it didn't support (that's just a couple that 
come to mind, and we shouldn't derail this conversation to talk about 
them :-/)

QEMU though only really provides a few low level building blocks, alongside
other building blocks you have to pull in from Linux. It doesn't even provide
a configuration file. Libvirt pulls all these pieces together to form the
complete managment QEMU API, as well as mapping everything onto the libvirt
domain XML & APIs. I think all there is scope & interest/demand to look at
creating an intermediate layer that provides a full managment layer for
QEMU, such that libvirt can eventually become just a mapping layer for
QEMU. In such a scenario the libvirt-devaddr library is still very useful
but you don't want it using the libvirt domain XML, as that's not likely
to be the format in use.

My opinion would be that it's not necessary for libvirt domain XML (or a 
subset) be the format, but that it also shouldn't necessarily be avoided 
(unless the alternative is better in some quantifiable way).

Anyway, in the end I think my opinion is we should push ahead and think 
about consequences of the specifics later, after some experimenting. I'd 
love to help if there's a place for it. I'm just not sure where/how I 
could contribute, especially since I have only about 4 hours worth of 
golang knowledge :-) (certainly not against getting more though!)