Re: libvirt-shim: libvirt to run qemu in a different container

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Mon, 27 Mar 2023 11:19:38 +0100

On Sun, Mar 26, 2023 at 03:57:00PM +0300, Itamar Holder wrote:
> Hey all,
> 
> I'm Itamar Holder, a Kubevirt developer.
> Lately we came across a problem w.r.t. properly supporting VMs with
> dedicated CPUs on Kubernetes. The full details can be seen in this PR
> <https://github.com/kubevirt/kubevirt/pull/8869> [1], but to make a very
> long story short, we would like to use two different containers in the
> virt-launcher pod that is responsible to run a VM:
> 
>    - "Managerial container": would be allocated with a shared cpuset. Would
>    run all of the virtualization infrastructure, such as libvirtd and its
>    dependencies.
>    - "Emulator container": would be allocated with a dedicated cpuset.
>    Would run the qemu process.
> 
> There are many reasons for choosing this design, but in short, the main
> reasons are that it's impossible to allocate both shared and dedicated cpus
> to a single container, and that it would allow finer-grained control and
> isolation for the different containers.
> 
> Since there is no way to start the qemu process in a different container, I
> tried to start qemu in the "managerial" container, then move it into the
> "emulator" container. This fails however, since libvirt uses
> sched_setaffinity to pin the vcpus into the dedicated cpuset, which is not
> allocated to the managerial container, resulting in an EINVAL error.

What do you mean by 'move it' ? Containers are are collection of kernel
namespaces, combined with cgroups placement. It isn't possible for an
external helper to change the namespaces of a process, so I'm presuming
you just mean that you tried to move the cgroups placement ?

In theory, when spawning QEMU, libvirt ought to be able to place QEMU
into pretty much any cpuset cgroup and/or cpu affinity that is supported
by the system, even if this is completely distinct from what libvirtd
itself is running under. What is it about the multi-container-in-one-pod
approach that prevents you from being able to tell libvirt the desired
CPU placement ?

I wonder though whether QEMU level granularity is really the right
approach here. QEMU has various threads. vCPU threads which I presume
are what you want to give dedicated resources too, but also I/O threads,
and various emulator related threads (migration, QMP monitor, and other
misc stuff). If you're moving the entire QEMU process to a dedicated
CPU container, either these extra emulator threads will compete with
the vCPU threads, or you'll need to reserve extra host CPU per VM
which gets pretty wasteful - eg a 1 vCPU guest needs 2 host CPUs
reserved. OpenStack took this approach initially but the inefficient
hardware utilization pushed towards having a pool of shared CPUs for
emulator threads and dedicated CPUs for vCPU threads.

Expanding on the question of non-vCPU emulator threads, one way of
looking at the system is to consider that libvirtd is a conceptual
part of QEMU that merely happens to run in a separate process instead
of a separate thread. IOW,  libvirtd is simply a few more non-vCPU
emulator thread(s), and as such any CPU placement done for non-vCPU
emulator threads should be done likewise for libvirtd threads. Trying
to separate non-vCPU threads from libvirtd threads is not a neccessary
goal.

> Therefore, I thought about discussing a new approach - introducing a small
> shim that could communicate with libvirtd in order to start and control the
> qemu process that would run on a different container.
> 
> As I see it, the main workflow could be described as follows:
> 
>    - The emulator container would start with the shim.
>    - libvirtd, running in the managerial container, would ask for some
>    information from the target, e.g. cpuset.
>    - libvirtd would create the domain xml and would transfer to the shim
>    everything needed in order to launch the guest.
>    - The shim, running in the emulator container, would run the
>    qemu-process.

The startup interaction between libvirt and QEMU is pretty complicated
code and we have changed it reasonably often, and I forsee a need to
keep changing it in future in potentially quite significant/disruptive
ways. If we permit use of a external shim as described, that is likely
to constrain our ability to make changes to our startup process in the
future, which will have a impact on our ability to maintain libvirt in
the future.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|