Hi, this is a proposal for introducing a new family of APIs in libvirt, with the goal of improving integration with management applications. KubeVirt is intended to be the primary consumer of these APIs. Background ---------- KubeVirt makes it possible to run VMs on a Kubernetes cluster, side by side with containers. It does so by running QEMU and libvirtd themselves inside a container. The architecture is explained in more detail at https://kubevirt.io/user-guide/architecture/ but for the purpose of this discussion we only need to keep in mind two components: * virt-launcher - runs in the same container as QEMU and libvirtd - one instance per VM * virt-handler - runs in a separate container - one instance per node Conceptually, these two components roughly map to QEMU processes and libvirtd respectively. >From a security perspective, there is a strong push in Kubernetes to run workloads under unprivileged user accounts and without additional capabilities. Again, this is similar to how libvirtd itself runs as root but the QEMU processes it starts are under the unprivileged "qemu" account. KubeVirt has been working towards the goal of running VMs as completely unprivileged workloads and made excellent progress so far. Some of the operations needed for running a VM, however, inherently require elevated privilege. In KubeVirt, the conundrum is solved by having virt-handler (a privileged component) take care of those operations, making it possible for virt-launcher (as well as QEMU and libvirtd) to run in an unprivileged context. Examples -------- Here are a few examples of how KubeVirt has been able to reduce the privilege required by virt-launcher by selectively handing over responsibilities to virt-handler: * Remove SYS_RESOURCE capability from launcher pod https://github.com/kubevirt/kubevirt/pull/2584 * Drop SYS_RESOURCE capability https://github.com/kubevirt/kubevirt/pull/5558 * Housekeeping cgroup https://github.com/kubevirt/kubevirt/pull/8233 * Real time VMs fail to change vCPU scheduler and priority in non-root deployments https://github.com/kubevirt/kubevirt/pull/8750 * virt-launcher: Drop SYS_PTRACE capability https://github.com/kubevirt/kubevirt/pull/8842 The pattern we can see is that, initially, libvirt just assumes that it can perform a certain privileged operation. This fails in the context of KubeVirt, where libvirtd runs with significantly reduced privileges. As a consequence, libvirt is patched to be more resilient to such lack of privilege: for example, instead of attempting to create a file and erroring out due to lack of permissions, it will instead first check whether the file already exists and, if it does, assume that it has been prepared ahead of time by an external entity. Limitations ----------- This approach works fine, but only for the privileged operations that would be performed by libvirt before the VM starts running. Looking at the "housekeeping cgroup" PR in particular, we notice that the VM is initially created in paused state: this is necessary in order to create a point in time in which all the VM threads already exist but, crucially, none of the vCPUs have stated running yet. This is the only opportunity to move threads across cgroups without invalidating the expectations of a real time workload. When it comes to live migration, however, there is no way to create similar conditions, since the VM is running on the destination host right out of the gate. As a consequence, live migration has to be blocked when the housekeeping cgroup is in use, which is an unfortunate limitation. Moreover, there's an overall sense of fragility surrounding these interactions: both KubeVirt and, to some extent, libvirt need to be acutely aware of what the other component is going to do, but there is never an explicit handover and the whole thing only works if you just so happen to do everything with the exact right order and timing. Proposal -------- In order to address the issues outlined above, I propose that we introduce a new set of APIs in libvirt. These APIs would expose some of the inner workings of libvirt, and as such would come with *massively reduced* stability guarantees compared to the rest of our public API. The idea is that applications such as KubeVirt, which track libvirt fairly closely and stay pinned to specific versions, would be able to adapt to changes in these APIs relatively painlessly. More traditional management applications such as virt-manager would simply not opt into using the new APIs and maintain the status quo. Using memlock as an example, the new API could look like typedef int (*virInternalSetMaxMemLockHandler)(pid_t pid, unsigned long long bytes); int virInternalSetProcessSetMaxMemLockHandler(virConnectPtr conn, virInternalSetMaxMemLockHandler handler); The application-provided handler would be responsible for performing the privileged operation (in this case raising the memlock limit for a process). For KubeVirt, virt-launcher would have to pass the baton to virt-handler. If such an handler is installed, libvirt would invoke it (and likely go through some sanity checks afterwards); if not, it would attempt to perform the privileged operation itself, as it does today. This would make the interaction between libvirt and the management application explicit rather than implicit. Not having to stick to our usual API stability guarantees would make it possible to be more liberal in exposing the internals of libvirt as interaction points. Scope ----- I think we should initially limit the new APIs to the scenarios that have already been identified, then gradually expand the scope as needed. In other words, we shouldn't comb through the codebase looking for potential adopters. Since the intended consumers of these APIs are those that can adopt a new libvirt release fairly quickly, this shouldn't be a problem. Once the pattern has been established, we might consider introducing support for it at the same time as a new feature that might benefit from it is added. Caveats ------- libvirt is all about stable API, so introducing an API that is unstable *by design* is completely uncharted territory. To ensure that the new APIs are 100% opt-in, we could define them in a separate <libvirt/libvirt-internal.h> header. Furthermore, we could have a separate libvirt-internal.so shared library for the symbols and a corresponding libvirt-internal.pc pkg-config file. We could even go as far as requiring a preprocessor symbol such as VIR_INTERNAL_UNSTABLE_API_OPT_IN to be defined before the entry points are visible to the compiler. Whatever the mechanism, we would need to make sure that it's usable from language bindings as well. Internal APIs are amendable to not only come and go, but also change semantics between versions. We should make sure that such changes are clearly exposed to the user, for example by requiring them to pass a version number to the function and erroring out immediately if the value doesn't match our expectations. KubeVirt has a massive suite of functional tests, so this kind of change would immediately be spotted when a new version of libvirt is imported, with no risk of an incompatibility lingering in the codebase until it affects users. Disclaimer ---------- This proposal is intentionally vague on several of the details. Before attempting to nail those down, I want to gather feedback on the high-level idea, both from the libvirt and KubeVirt side. Credits ------- Thanks to Michal and Martin for helping shape and polish the idea from its initial rough state. -- Andrea Bolognani / Red Hat / Virtualization