The problem(s) ============== While a hypervisor agnostic API is useful for some users, it is completely irrelevant, and potentally even painful, for other users. We made some concessions to this when we introduced hypervisor specific XML namespaces and option for hypervisor specific add-on APIs. We tell apps these are all unsupported for production usage though. IOW, have this pony, but you can never play with it. The hypervisor agnostic API approach inevitably took us in a direction where libvirt (or something below it) is in charge of managing the QEMU process lifecycle. We can't expose the concept of process management upto the client application because many hypervisors don't present virtual machine as UNIX processes, or merely have processes as a secondary concept eg with Xen a QEMU process is just subservient to the main Xen guest domain concept. Essentially libvirt expects the application to treat the hypervisor / compute host as a black box and just rely on libvirt APIs for managing the virtual machines, because that is the only way to provide a hypervisor agnostic view of a commpute host. This approach also gives parity of functionality regardless of whether the management app is on a remote machine, vs colocated locally with libvirtd. Most of the large scale management applications have ended up with a model where they have a component on each compute host talking to libvirt locally over a UNIX socket, with TCP based access only really used for live migration. Thus the management apps have rarely considered the Linux OS to truely be a black box when dealing with KVM. To some degree they all peer inside the box, and wish to take advantage of some of the concepts Linux exposes to integrate with the hypervisor. The inability to directly associate a client with the lifecycle of a single QEMU process has long been a source of frustration to libguestfs. The level of indirection forced by use of libvirtd does not map well to how libguestfs wants to use QEMU. Essentially libguestfs isn't trying to use QEMU in a system mangement scenario, but rather utilize it as an embedded technology component. As a result, libguestfs still has its own non-libvirt based way of spawning QEMU which is often used in preference to its libvirt based impl. Other apps like libvirt-sandbox have faced the same. When systemd came into existance, finally providing good mechanisms for process management in Linux machines, we found a tension between what libvirt wants todo and what systemd wants todo. The best we've managed is a compromise where libvirt spawns the guest, but then registers it with systemd. Users can't directly spawn QEMU guests with systemd and then manage them with libvirt. We've not seen people seriously try to manage QEMU guests directly with systemd, but it is fair to say that the combination of systemd and docker have taken away most potential users of libvirt's LXC driver, as apps managing containers don't want to treat the host as a blackbox, they want to have more direct control. The failure to get adoption of the LXC driver serves as a cautionary tale for what could happen to use of the libvirt QEMU driver in future. More recently the increasing interest in use of containers is raising new interesting architectures for the management of processes. In particular the Kubernetes project can be considerd to provide cluster-wide management of processes, aka k8s is systemd for data centers. Again there is interest in using Kubernetes to manage QEMU guests across the data center. The KubeVirt project is attempting to bridge the conflicting world views of libvirt and Kubernetes to build a KVM management system to eventually replace both oVirt and OpenStack. The need to have libvirtd spawn the QEMU processes is causing severe complications for KubeVirt architecture, causing them to seriously consider not using libvirt for management KVM. This issue is a major blocking item for KubeVirt to the extent that they may well have to abandon use of libvirt to get the process startup & resource integration model they need. On Linux as far as hypervisor technology is concerned, KVM has won the battles and the war. OpenStack user surveys have constantly put KVM/QEMU on top with at least one order of magnitude higher usage than any other technology. Amazon was always the major reference for usage of Xen in public cloud and even they appear to be about to pivot to KVM. IOW, while providing a hypervisor agnostic management API is still a core competancy of libvirt, we need to embrace the reality that KVM is the defacto standard on Linux and better enable people to take avantage of its unique features, because that is where most of our userbase is. A second example of limitations of the purely hypervisor agnostic approach to libvirt is the way our API design is fully synchronous. An application calling a libvirt API blocks until its execution is complete. This approach was originally driven by the need to integrate directly with various Xen backend APIs which were also mostly synchronous in design. Later we added other hypervisor targets which also exposed synchronous APIs. In parallel though, we added the libvirtd daemon for running stateful hypervisor drivers like QEMU, LXC, UML, and now Xen. We speak to this over an RPC system that can handle arbitrarily overlapping asynchronous requests, but then force it into our synchronous public API. For applications which only care about using KVM, the ability to use an asynchronous API could be very interesting as it would no longer force them to spawn large numbers of threads to get parallel API execution. The Solution(s) =============== Currently our long term public stability promise just covers the XML format and library API. To enable more interesting usage of hypervisor specific concepts it is important to consider how to provide other options beyond just the current API and XML formats. IOW, I'm not talking about making QMP command passthrough or CLI arg passthrough fully supported features, as libvirt's API & XML abstraction has clear value there. Rather I'm thinking about more architectural level changes. In particular I want to try to break down the black box model of the host, to make it possible to exploit KVM's key distinguishing feature, which is that the guest is just a normal process. An application that knows how to spawn & reap processes should be able to launch KVM as if it was just another normal process. This implies that the application needs the option to handle the fork+exec of KVM, instead of libvirt, if it so wishes. I would anticipate a standalone process "libvirt-qemu" that an application can spawn, providing a normal domain XML file via the command line or stdin. It would then connect to libvirtd to register its existance and claim its ownership of the guest name + UUID. Assuming that succeeds, 'libvirt-qemu' would directly spawn QEMU. In this manner, the QEMU process automatically inherits all the characteristics of the application that invoked the 'libvirt-qemu' binary. This means it shares the user / group ID, the security context, the cgroup placement, the set of kernel namespaces, etc. Libvirt would honour these characteristics by default, but also have ability to further refine them. For example, it would honour the initial process CPU pinning, but could still further pin individual QEMU threads. In the initial implementation I would anticipate that libvirtd still retains control over pretty much every other aspect of ongoing QEMU management. ie libvirtd still owns the monitor connection. This means there would be some assumptions / limitations in functionality in the short term. eg it might be assumed that while libvirtd & libvirt-qemu can be in different mount namespaces, they must none the less be able to see the same underlying storage in their respective namespaces. The next mail in this series, however, takes things further to move actual driver functionality into libvirt-qemu, at which point limitations around namespaces would be largely eliminated. This design would solve the single biggest problem with managing QEMU from apps like libguestfs, systemd and KubeVirt. To avoid having 2 divergant launch processes, when libvirtd itself launches a QEMU process, it would have to use the same "libvirt-qemu' shim to do so. This would ensure functional equivalance regardless of whether the management app used the hypervisor agnostic API, or instead used the QEMU specific approach of running "libvirt-qemu". We made a crude attempt previously to allow apps to run their own QEMU and have it managed by libvirt, via the virDomainQemuAttach API. That API design was impossible to ever consider fully supported, because the mgmt application was still in charge of designing QEMU command line arguments, and it is impractical for libvirt to cope with an arbitrary set of args. With the new proposal, we're still using the current libvirt code for converting XML into QEMU args, so have a predictable configuration for QEMU. Thus the new approach can provide a fully supported way for applications to spawn QEMU. This concept of a "libvirt-qemu" shim is not all that far away from the current "libvirt-lxc" shim we have. With this in mind, it would also be desirable to make that a fully supported way to spawn LXC processes, which can then be managed by libvirt. This would make the libvirt LXC driver more interesting for people who wish to run containers (though it is admittedly to late to really recapture any significant usage from other container technologies). As mentioned earlier, if an application is only concerned with managing of KVM (or other stateful drivers running inside libvirtd), we have scope to be able to expose a fully asynchronous management API to applications. Such an undertaking would effectively mean creating an entirely new libvirt client library, to expose the asynchronous design, and we obvious have to keep the current library around long term regardless. Creating a new library would also involve creating new language bindings, which just adds to the work. Rather than undertake this massive amount of extra work, I think it is worth considering declaring the RPC protocol to be a fully supported interface for applications to consume. There are already projects which are re-implemented the libvirt client API directly ontop of the RPC protocol, bypassing libvirt.so. We have always strongly discouraged this, but none the less it has happened. As we have to maintain strong protocol compatibility on the RPC layer, it is effectively a stable API already. We cannot ever change it in incompatible manner without breaking our own client library implementation. So declaring it a formally supported interface for libvirt would not really involve any significant extra work on our part, just acknowledgement of the existing reality. It would perhaps involve some documentation work to assist developers wishing to consume it though. We would also have to outline the caveats of taking such an approach, which principally involve loosing ability to use the stateless hypervisor drivers which all live in the libvirt library. This is not a real issue though, because the people building ontop of the RPC protocol only care about KVM. Another example where exposing a KVM specific model might help is wrt live migration, specifically the initial launch of QEMU on the target host. Our libvirt migration API doesn't given the application direct control over this part which has caused apps like OpenStack to jump through considerable hoops when doing live migration. So just as an application should be able to launch the initial QEMU process, it should be able to directly launch it ready for incoming migration, and then trigger live migration to use this pre-launched virtual machine. In general the concept is that although the primary libvirt.so API will still consider the virt host to be a black box, below this libvirt should not be afraid to open up the black box to applications to expose hypervisor specific details as fully supported concepts. Applications can opt-in to using this, or continue to solely use the hypervisor agnostic API, as best fits their needs. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list