On 5/4/20 5:15 PM, Christophe de Dinechin wrote: > > >> On 4 Mar 2020, at 13:50, Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote: >> >> We've been doing alot of refactoring of code in recent times, and also >> have plans for significant infrastructure changes. We still need to >> spend time delivering interesting features to users / applications. >> This mail is to introduce an idea for a solution to an specific >> area applications have had long term pain with libvirt's current >> "mechanism, not policy" approach - device addressing. This is a way >> for us to show brand new ideas & approaches for what the libvirt >> project can deliver in terms of management APIs. >> >> To set expectations straight: I have written no code for this yet, >> merely identified the gap & conceptual solution. >> >> >> The device addressing problem >> ============================= >> >> One of the key jobs libvirt does when processing a new domain XML >> configuration is to assign addresses to all devices that are present. >> This involves adding various device controllers (PCI bridges, PCI root >> ports, IDE/SCSI buses, USB controllers, etc) if they are not already >> present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each >> device so they are associated with controllers. When libvirt spawns a >> QEMU guest, it will pass full address information to QEMU. >> >> Libvirt, as a general rule, aims to avoid defining and implementing >> policy around expansion of guest configuration / defaults, however, it >> is inescapable in the case of device addressing due to the need to >> guarantee a stable hardware ABI to make live migration and save/restore >> to disk work. The policy that libvirt has implemented for device >> addressing is, as much as possible, the same as the addressing scheme >> QEMU would apply itself. >> >> While libvirt succeeds in its goal of providing a stable hardware API, >> the addressing scheme used is not well suited to all deployment >> scenarios of QEMU. This is an inevitable result of having a specific >> assignment policy implemented in libvirt which has to trade off mutually >> incompatible use cases/goals. >> >> When the libvirt addressing policy is not been sufficient, management >> applications are forced to take on address assignment themselves, >> which is a massive non-trivial job with many subtle problems to >> consider. >> >> Places where libvirt's addressing is insufficient for PCI include >> >> * Setting up multiple guest NUMA nodes and associating devices to >> specific nodes >> * Pre-emptive creation of extra PCIe root ports, to allow for later >> device hotplug on PCIe topologies >> * Determining whether to place a device on a PCI or PCIe bridge >> * Controlling whether a device is placed into a hotpluggable slot >> * Controlling whether a PCIe root port supports hotplug or not >> * Determining whether to places all devices on distinct slots or >> buses, vs grouping them all into functions on the same slot >> * Ability to expand the device addressing without being on the >> hypervisor host >> >> Libvirt wishes to avoid implementing many different address assignment >> policies. It also wishes to keep the domain XML as a representation >> of the virtual hardware, not add a bunch of properties to it which >> merely serve as tunable input parameters for device addressing >> algorithms. >> >> There is thus a dilemma here. Management applications increasingly >> need fine grained control over device addressing, while libvirt >> doesn't want to expose fine grained policy controls via the XML. >> >> >> The new libvirt-devaddr API >> =========================== >> >> The way out of this is to define a brand new virt management API >> which tackles this specific problem in a way that addresses all the >> problems mgmt apps have with device addressing and explicitly >> provides a variety of policy impls with tunable behaviour. >> >> By "new API", I actually mean an entirely new library, completely >> distinct from libvirt.so, or anything else we've delivered so >> far. The closest we've come to delivering something at this kind >> of conceptual level, would be the abortive attempt we made with >> "libvirt-builder" to deliver a policy-driven API instead of mechanism >> based. This proposal is still quite different from that attempt. >> >> At a high level >> >> * The new API is "libvirt-devaddr" - short for "libvirt device addressing" >> >> * As input it will take >> >> 1. The guest CPU architecture and machine type >> 2. A list of global tunables specifying desired behaviour of the >> address assignment policy >> 3. A minimal list of devices needed in the virtual machine, with >> optional addresses and optional per-device tunables to override >> the global tunables >> >> * As output it will emit >> >> 1. fully expanded list of devices needed in the virtual machine, >> with addressing information sufficient to ensure stable hardware ABI >> >> Initially the API would implement something that behaves the same >> way as libvirt's current address assignment API. >> >> The intended usage would be >> >> * Mgmt application makes a minimal list of devices they want in >> their guest >> * List of devices is fed into libvirt-devaddr API >> * Mgmt application gets back a full list of devices & addresses >> * Mgmt application writes a libvirt XML doc using this full list & >> addresses >> * Mgmt application creates the guest in libvirt > > +Adrian, +Andrea, +Michal > > It dawned on me that kata may provide an additional “borderline” > usage model for this new API. Specifically, it might be a case where > the tunables may be “relayed” through kata-runtime, but really > originate from OpenShift. > OCI Device specification is mknod-based [1] having no bus-specific information so I think all of logic would be implemented by kata-runtime. However, the Device Plugin specifies an ENV variable with the host PCI address. > Also, what about in-guest device naming / assignment? This is a problem because the ENV var will not match the guest's device address. I don't see a way around this without having a deterministic way of addressing devices and modifying/complementing that higher level information. > > Adrian, do you think that the iommu group issues you ran into > could help Dan validate that the new library has all the input it > needs to make a sane choice in that case? > I don't think the iommu group problem would require interaction with the library. Kata agent was just mknod-ing the devices. Fixed in [2] > Do you think that it would be possible to call the library twice > with different tunables in order to get the host and guest device > names? > I don't think I fully understand your proposal. Once qemu is called with a specific set of device addresses, what could possibly be done in the guest? In order to be able to consume the devices, the application would need to know the host->guest address mappings. Whether that mapping is exposed via kata-agent, ENV var or other means, is yet to be discussed. WRT to the library itself, I think it would alleviate some of the logic currently being implemented in kata-runtime that includes things like: - Determining whether the device's BAR size is small enough for it to be hot-plugged in a pci bridge - Determining whether the machine type supports hotplugging on the root bus, or root-ports need to be pre-allocated. Related work: [4 [5] and associated PRs >> >> IOW, this new "libvirt-devaddr" API is intended to be used prior to >> creating the XML that is used by libvirt. The API could also be used >> prior to needing to hotplug a new device to an existing guest. >> This API is intended to be a deliverable of the libvirt project, but >> it would be completely independent of the current libvirt API. Most >> especially note that it would NOT use the domain XML in any way. >> This gives applications maximum flexibility in how they consume this >> functionality, not trying to force a way to build domain XML. >> >> >> It would have greater freedom in its API design, making different >> choices from libvirt.so on topics such as programming language (C vs >> Go vs Python etc), API stability timeframe (forever stable vs sometimes >> changing API), data formats (structs, vs YAML/JSON vs XML etc), and of >> course the conceptual approach (policy vs mechanism) >> >> The expectation is that this new API would be most likely to be >> consumed by KubeVirt, OpenStack, Kata, as the list of problems shown >> earlier is directly based on issues seen working with KubeVirt & >> OpenStack in particular. It is not limited to these applications and >> is broadly useful as conceptual thing. >> >> It would be a goal that this API should also be used by libvirt >> itself to replace its current internal device addressing impl. >> Essentially the new API should be seen as a way to expose/extract >> the current libvirt internal algorithm, making it available to >> applications in a flexible manner. I don't anticipate actually copying >> the current addressing code in libvirt as-is, but it would certainly >> serve as reference for the kind of logic we need to implement, so you >> might consider it a "port" or "rewrite" in some very rough sense. >> >> I think this new API concept is a good way for the project make a start >> in using Go for libvirt. The functionality covered has a clearly defined >> scope limit, making it practical to deliver a real impl in a reasonably >> short time frame. Extracting this will provide a real world benefit to >> our application consumers, solving many long standing problems they have >> with libvirt, and thus justify the effort in doing this work in libvirt >> in a non-C language. The main question mark would be about how we might >> make this functionality available to Python apps if we chose Go. It is >> possible to expose a C API from Go, and we would need this to consume it >> from libvirt. There is then the need to manually write a Python API binding >> which is tedious work. >> >> Regards, >> Daniel >> -- >> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| >> |: https://libvirt.org -o- https://fstop138.berrange.com :| >> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >> > [1] https://github.com/opencontainers/runtime-spec/blob/2a060269036678148a707a92eeec6d2f8ee50553/specs-go/config.go#L378 [2] https://github.com/kata-containers/runtime/pull/2550/commits/4d2574a7230e5a1bd45302f91f2701e50a6e57f2 [3] https://github.com/kata-containers/runtime/issues/115 [4] https://github.com/kata-containers/runtime/issues/2432 [5] https://github.com/kata-containers/runtime/issues/2460