On Fri, Mar 13, 2020 at 12:47 PM Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote: > > On Fri, Mar 13, 2020 at 11:23:44AM +0200, Dan Kenigsberg wrote: > > On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange@xxxxxxxxxx> wrote: > > > > > > We've been doing alot of refactoring of code in recent times, and also > > > have plans for significant infrastructure changes. We still need to > > > spend time delivering interesting features to users / applications. > > > This mail is to introduce an idea for a solution to an specific > > > area applications have had long term pain with libvirt's current > > > "mechanism, not policy" approach - device addressing. This is a way > > > for us to show brand new ideas & approaches for what the libvirt > > > project can deliver in terms of management APIs. > > > > > > To set expectations straight: I have written no code for this yet, > > > merely identified the gap & conceptual solution. > > > > > > > > > The device addressing problem > > > ============================= > > > > > > One of the key jobs libvirt does when processing a new domain XML > > > configuration is to assign addresses to all devices that are present. > > > This involves adding various device controllers (PCI bridges, PCI root > > > ports, IDE/SCSI buses, USB controllers, etc) if they are not already > > > present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each > > > device so they are associated with controllers. When libvirt spawns a > > > QEMU guest, it will pass full address information to QEMU. > > > > > > Libvirt, as a general rule, aims to avoid defining and implementing > > > policy around expansion of guest configuration / defaults, however, it > > > is inescapable in the case of device addressing due to the need to > > > guarantee a stable hardware ABI to make live migration and save/restore > > > to disk work. The policy that libvirt has implemented for device > > > addressing is, as much as possible, the same as the addressing scheme > > > QEMU would apply itself. > > > > > > While libvirt succeeds in its goal of providing a stable hardware API, > > > the addressing scheme used is not well suited to all deployment > > > scenarios of QEMU. This is an inevitable result of having a specific > > > assignment policy implemented in libvirt which has to trade off mutually > > > incompatible use cases/goals. > > > > > > When the libvirt addressing policy is not been sufficient, management > > > applications are forced to take on address assignment themselves, > > > which is a massive non-trivial job with many subtle problems to > > > consider. > > > > > > Places where libvirt's addressing is insufficient for PCI include > > > > > > * Setting up multiple guest NUMA nodes and associating devices to > > > specific nodes > > > * Pre-emptive creation of extra PCIe root ports, to allow for later > > > device hotplug on PCIe topologies > > > * Determining whether to place a device on a PCI or PCIe bridge > > > * Controlling whether a device is placed into a hotpluggable slot > > > * Controlling whether a PCIe root port supports hotplug or not > > > * Determining whether to places all devices on distinct slots or > > > buses, vs grouping them all into functions on the same slot > > > * Ability to expand the device addressing without being on the > > > hypervisor host > > > > (I don't understand the last bullet point) > > I'm not sure if this is still the case, but at some point in time > there was a desire from KubeVirt to be able to expand the users' > configuration when loaded in KubeVirt, filling in various defaults > for devices. This would run when the end user YAML/JSON config > was first posted to the k8s API for storage, some arbitrary amount > of time later the config gets chosen to run on a virtualization > host at which point it is turned into libvirt domain XML. Ah, I did not hear about this before, but I see why something like this would be useful even without libvirt-devaddr. Having something like virDomainDryRunXML() would have eliminated old race conditions we had in oVirt. > > > > Libvirt wishes to avoid implementing many different address assignment > > > policies. It also wishes to keep the domain XML as a representation > > > of the virtual hardware, not add a bunch of properties to it which > > > merely serve as tunable input parameters for device addressing > > > algorithms. > > > > > > There is thus a dilemma here. Management applications increasingly > > > need fine grained control over device addressing, while libvirt > > > doesn't want to expose fine grained policy controls via the XML. > > > > > > > > > The new libvirt-devaddr API > > > =========================== > > > > > > The way out of this is to define a brand new virt management API > > > which tackles this specific problem in a way that addresses all the > > > problems mgmt apps have with device addressing and explicitly > > > provides a variety of policy impls with tunable behaviour. > > > > > > By "new API", I actually mean an entirely new library, completely > > > distinct from libvirt.so, or anything else we've delivered so > > > far. The closest we've come to delivering something at this kind > > > of conceptual level, would be the abortive attempt we made with > > > "libvirt-builder" to deliver a policy-driven API instead of mechanism > > > based. This proposal is still quite different from that attempt. > > > > > > At a high level > > > > > > * The new API is "libvirt-devaddr" - short for "libvirt device addressing" > > > > > > * As input it will take > > > > > > 1. The guest CPU architecture and machine type > > > 2. A list of global tunables specifying desired behaviour of the > > > address assignment policy > > > 3. A minimal list of devices needed in the virtual machine, with > > > optional addresses and optional per-device tunables to override > > > the global tunables > > > > > > * As output it will emit > > > > > > 1. fully expanded list of devices needed in the virtual machine, > > > with addressing information sufficient to ensure stable hardware ABI > > > > > > Initially the API would implement something that behaves the same > > > way as libvirt's current address assignment API. > > > > > > The intended usage would be > > > > > > * Mgmt application makes a minimal list of devices they want in > > > their guest > > > * List of devices is fed into libvirt-devaddr API > > > * Mgmt application gets back a full list of devices & addresses > > > * Mgmt application writes a libvirt XML doc using this full list & > > > addresses > > > * Mgmt application creates the guest in libvirt > > > > > > IOW, this new "libvirt-devaddr" API is intended to be used prior to > > > creating the XML that is used by libvirt. The API could also be used > > > prior to needing to hotplug a new device to an existing guest. > > > This API is intended to be a deliverable of the libvirt project, but > > > it would be completely independent of the current libvirt API. Most > > > especially note that it would NOT use the domain XML in any way. > > > This gives applications maximum flexibility in how they consume this > > > functionality, not trying to force a way to build domain XML. > > > > This procedure forces Mgmt to learn a new language to describe device > > placement. Mgmt (or should I just say "we"?) currently expresses the > > "minimal list of devices" in XML form and pass it to libvirt. Here we > > are asked to pass it once to libvirt-devaddr, parse its output, and > > feed it as XML to libvirt. > > I'm not neccessarily suggesting we even need a document format the > core API level. I could easily see the API working in terms of a > list of Go structs, with tunables being normal method parameters. > A JSON format could be an optional way to serialize the Go structs, > but if the app were written in Go the JSON may not be needed at all. > > > I believe it would be easier to use the domxml as the base language > > for the new library, too. libvirt-devaddr would accept it with various > > hints (expressed as its own extension to the XML?) such as "place all > > of these devices in the same NUMA node", "keep on root bus" or > > "separate these two chattering devices to their own bus". The output > > of libvirt-devaddr would be a domxml with <devices> filled with > > controllers and addresses, readily available for consumption by > > libvirt. > > I don't believe that using the libvirt domain XML is a good idea for > this as it uneccesssarily constrains the usage scenarios. Most management > applications do not use the domain XML as their canonical internal storage > format. KubeVirt has its JSON/YAML schema for k8s API, OpenStack/RHEV just > store metadata in their DB, others vary again. Some of these applications > benefit from being able to expand device topology/addressing, a long time > before they get any where near use of domain XML - the latter only matters > when you come to instantiate a VM on a particular host. Nevertheless, your suggested Go struct would become a third representation of virtual devices, on top of domxml and the Mgmt-canonical one. Maybe I'm just overconservative. Let us ask kubevirt-dev what would be their preferable form to consume this suggested API. > > We could of coure have a convenience method which optionally generates > a domain XML template from the output list of devices, if someone believes > that's useful to standardize on, but I don't think the domain XML should > be the core format format. > > I would also like this library to usable for scenarios in which libvirt > is not involved at all. One of the strange things about the QEMU driver > in libvirt compared to the other hypervisor drivers is that it is missing > an intermediate API layer. In other drivers the hypervisor platform itself > provides a full management API layer, and libvirt merely maps the libvirt > APIs to the underling mgmt API or data formats. IOW, libvirt is just a > mapping layer. > > QEMU though only really provides a few low level building blocks, alongside > other building blocks you have to pull in from Linux. It doesn't even provide > a configuration file. Libvirt pulls all these pieces together to form the > complete managment QEMU API, as well as mapping everything onto the libvirt > domain XML & APIs. I think all there is scope & interest/demand to look at > creating an intermediate layer that provides a full managment layer for > QEMU, such that libvirt can eventually become just a mapping layer for > QEMU. In such a scenario the libvirt-devaddr library is still very useful > but you don't want it using the libvirt domain XML, as that's not likely > to be the format in use. > > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >