Re: Kata needs for device addressing (was Re: libvirt-devaddr: a new library for device address assignment)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/4/20 5:15 PM, Christophe de Dinechin wrote:
> 
> 
>> On 4 Mar 2020, at 13:50, Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote:
>>
>> We've been doing alot of refactoring of code in recent times, and also
>> have plans for significant infrastructure changes. We still need to
>> spend time delivering interesting features to users / applications.
>> This mail is to introduce an idea for a solution to an specific
>> area applications have had long term pain with libvirt's current
>> "mechanism, not policy" approach - device addressing. This is a way
>> for us to show brand new ideas & approaches for what the libvirt
>> project can deliver in terms of management APIs.
>>
>> To set expectations straight: I have written no code for this yet,
>> merely identified the gap & conceptual solution.
>>
>>
>> The device addressing problem
>> =============================
>>
>> One of the key jobs libvirt does when processing a new domain XML
>> configuration is to assign addresses to all devices that are present.
>> This involves adding various device controllers (PCI bridges, PCI root
>> ports, IDE/SCSI buses, USB controllers, etc) if they are not already
>> present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each
>> device so they are associated with controllers. When libvirt spawns a
>> QEMU guest, it will pass full address information to QEMU.
>>
>> Libvirt, as a general rule, aims to avoid defining and implementing
>> policy around expansion of guest configuration / defaults, however, it
>> is inescapable in the case of device addressing due to the need to
>> guarantee a stable hardware ABI to make live migration and save/restore
>> to disk work.  The policy that libvirt has implemented for device
>> addressing is, as much as possible, the same as the addressing scheme
>> QEMU would apply itself.
>>
>> While libvirt succeeds in its goal of providing a stable hardware API,
>> the addressing scheme used is not well suited to all deployment
>> scenarios of QEMU. This is an inevitable result of having a specific
>> assignment policy implemented in libvirt which has to trade off mutually
>> incompatible use cases/goals.
>>
>> When the libvirt addressing policy is not been sufficient, management
>> applications are forced to take on address assignment themselves,
>> which is a massive non-trivial job with many subtle problems to
>> consider.
>>
>> Places where libvirt's addressing is insufficient for PCI include
>>
>> * Setting up multiple guest NUMA nodes and associating devices to
>>   specific nodes
>> * Pre-emptive creation of extra PCIe root ports, to allow for later
>>   device hotplug on PCIe topologies
>> * Determining whether to place a device on a PCI or PCIe bridge
>> * Controlling whether a device is placed into a hotpluggable slot
>> * Controlling whether a PCIe root port supports hotplug or not
>> * Determining whether to places all devices on distinct slots or
>>   buses, vs grouping them all into functions on the same slot
>> * Ability to expand the device addressing without being on the
>>   hypervisor host
>>
>> Libvirt wishes to avoid implementing many different address assignment
>> policies. It also wishes to keep the domain XML as a representation
>> of the virtual hardware, not add a bunch of properties to it which
>> merely serve as tunable input parameters for device addressing
>> algorithms.
>>
>> There is thus a dilemma here. Management applications increasingly
>> need fine grained control over device addressing, while libvirt
>> doesn't want to expose fine grained policy controls via the XML.
>>
>>
>> The new libvirt-devaddr API
>> ===========================
>>
>> The way out of this is to define a brand new virt management API
>> which tackles this specific problem in a way that addresses all the
>> problems mgmt apps have with device addressing and explicitly
>> provides a variety of policy impls with tunable behaviour.
>>
>> By "new API", I actually mean an entirely new library, completely
>> distinct from libvirt.so, or anything else we've delivered so
>> far. The closest we've come to delivering something at this kind
>> of conceptual level, would be the abortive attempt we made with
>> "libvirt-builder" to deliver a policy-driven API instead of mechanism
>> based. This proposal is still quite different from that attempt.
>>
>> At a high level
>>
>> * The new API is "libvirt-devaddr" - short for "libvirt device addressing"
>>
>> * As input it will take
>>
>>   1. The guest CPU architecture and machine type
>>   2. A list of global tunables specifying desired behaviour of the
>>      address assignment policy
>>   3. A minimal list of devices needed in the virtual machine, with
>>      optional addresses and optional per-device tunables to override
>>      the global tunables
>>
>> * As output it will emit
>>
>>   1. fully expanded list of devices needed in the virtual machine,
>>      with addressing information sufficient to ensure stable hardware ABI
>>
>> Initially the API would implement something that behaves the same
>> way as libvirt's current address assignment API.
>>
>> The intended usage would be
>>
>> * Mgmt application makes a minimal list of devices they want in
>>   their guest
>> * List of devices is fed into libvirt-devaddr API
>> * Mgmt application gets back a full list of devices & addresses
>> * Mgmt application writes a libvirt XML doc using this full list &
>>   addresses
>> * Mgmt application creates the guest in libvirt
> 
> +Adrian, +Andrea, +Michal
> 
> It dawned on me that kata may provide an additional “borderline”
> usage model for this new API. Specifically, it might be a case where
> the tunables may be “relayed” through kata-runtime, but really
> originate from OpenShift.
> 
OCI Device specification is mknod-based [1] having no bus-specific information
so I think all of logic would be implemented by kata-runtime.
However, the Device Plugin specifies an ENV variable with the host PCI address.

> Also, what about in-guest device naming / assignment?
This is a problem because the ENV var will not match the guest's device address.
I don't see a way around this without having a deterministic way of addressing
devices and modifying/complementing that higher level information.

> 
> Adrian, do you think that the iommu group issues you ran into
> could help Dan validate that the new library has all the input it
> needs to make a sane choice in that case?
> 
I don't think the iommu group problem would require interaction with the
library. Kata agent was just mknod-ing the devices. Fixed in [2]

> Do you think that it would be possible to call the library twice
> with different tunables in order to get the host and guest device
> names?
> I don't think I fully understand your proposal. Once qemu is called with a
specific set of device addresses, what could possibly be done in the guest?

In order to be able to consume the devices, the application would need to know
the host->guest address mappings. Whether that mapping is exposed via
kata-agent, ENV var or other means, is yet to be discussed.

WRT to the library itself, I think it would alleviate some of the logic
currently being implemented in kata-runtime that includes things like:
- Determining whether the device's BAR size is small enough for it to be
hot-plugged in a pci bridge
- Determining whether the machine type supports hotplugging on the root bus, or
root-ports need to be pre-allocated.

Related work: [4 [5] and associated PRs

>>
>> IOW, this new "libvirt-devaddr" API is intended to be used prior to
>> creating the XML that is used by libvirt. The API could also be used
>> prior to needing to hotplug a new device to an existing guest.
>> This API is intended to be a deliverable of the libvirt project, but
>> it would be completely independent of the current libvirt API. Most
>> especially note that it would NOT use the domain XML in any way.
>> This gives applications maximum flexibility in how they consume this
>> functionality, not trying to force a way to build domain XML.
>>
>>
>> It would have greater freedom in its API design, making different
>> choices from libvirt.so on topics such as programming language (C vs
>> Go vs Python etc), API stability timeframe (forever stable vs sometimes
>> changing API), data formats (structs, vs YAML/JSON vs XML etc), and of
>> course the conceptual approach (policy vs mechanism)
>>
>> The expectation is that this new API would be most likely to be
>> consumed by KubeVirt, OpenStack, Kata, as the list of problems shown
>> earlier is directly based on issues seen working with KubeVirt &
>> OpenStack in particular. It is not limited to these applications and
>> is broadly useful as conceptual thing.
>>
>> It would be a goal that this API should also be used by libvirt
>> itself to replace its current internal device addressing impl.
>> Essentially the new API should be seen as a way to expose/extract
>> the current libvirt internal algorithm, making it available to
>> applications in a flexible manner. I don't anticipate actually copying
>> the current addressing code in libvirt as-is, but it would certainly
>> serve as reference for the kind of logic we need to implement, so you
>> might consider it a "port" or "rewrite" in some very rough sense.
>>
>> I think this new API concept is a good way for the project make a start
>> in using Go for libvirt. The functionality covered has a clearly defined
>> scope limit, making it practical to deliver a real impl in a reasonably
>> short time frame. Extracting this will provide a real world benefit to
>> our application consumers, solving many long standing problems they have
>> with libvirt, and thus justify the effort in doing this work in libvirt
>> in a non-C language. The main question mark would be about how we might
>> make this functionality available to Python apps if we chose Go. It is
>> possible to expose a C API from Go, and we would need this to consume it
>> from libvirt. There is then the need to manually write a Python API binding
>> which is tedious work.
>>
>> Regards,
>> Daniel
>> -- 
>> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
>> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>>
> 

[1]
https://github.com/opencontainers/runtime-spec/blob/2a060269036678148a707a92eeec6d2f8ee50553/specs-go/config.go#L378
[2]
https://github.com/kata-containers/runtime/pull/2550/commits/4d2574a7230e5a1bd45302f91f2701e50a6e57f2
[3] https://github.com/kata-containers/runtime/issues/115
[4] https://github.com/kata-containers/runtime/issues/2432
[5] https://github.com/kata-containers/runtime/issues/2460






[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux