On Sun, Mar 26, 2023 at 03:57:00PM +0300, Itamar Holder wrote: > Hey all, > > I'm Itamar Holder, a Kubevirt developer. > Lately we came across a problem w.r.t. properly supporting VMs with > dedicated CPUs on Kubernetes. The full details can be seen in this PR > <https://github.com/kubevirt/kubevirt/pull/8869> [1], but to make a very > long story short, we would like to use two different containers in the > virt-launcher pod that is responsible to run a VM: > > - "Managerial container": would be allocated with a shared cpuset. Would > run all of the virtualization infrastructure, such as libvirtd and its > dependencies. > - "Emulator container": would be allocated with a dedicated cpuset. > Would run the qemu process. > > There are many reasons for choosing this design, but in short, the main > reasons are that it's impossible to allocate both shared and dedicated cpus > to a single container, and that it would allow finer-grained control and > isolation for the different containers. > > Since there is no way to start the qemu process in a different container, I > tried to start qemu in the "managerial" container, then move it into the > "emulator" container. This fails however, since libvirt uses > sched_setaffinity to pin the vcpus into the dedicated cpuset, which is not > allocated to the managerial container, resulting in an EINVAL error. What do you mean by 'move it' ? Containers are are collection of kernel namespaces, combined with cgroups placement. It isn't possible for an external helper to change the namespaces of a process, so I'm presuming you just mean that you tried to move the cgroups placement ? In theory, when spawning QEMU, libvirt ought to be able to place QEMU into pretty much any cpuset cgroup and/or cpu affinity that is supported by the system, even if this is completely distinct from what libvirtd itself is running under. What is it about the multi-container-in-one-pod approach that prevents you from being able to tell libvirt the desired CPU placement ? I wonder though whether QEMU level granularity is really the right approach here. QEMU has various threads. vCPU threads which I presume are what you want to give dedicated resources too, but also I/O threads, and various emulator related threads (migration, QMP monitor, and other misc stuff). If you're moving the entire QEMU process to a dedicated CPU container, either these extra emulator threads will compete with the vCPU threads, or you'll need to reserve extra host CPU per VM which gets pretty wasteful - eg a 1 vCPU guest needs 2 host CPUs reserved. OpenStack took this approach initially but the inefficient hardware utilization pushed towards having a pool of shared CPUs for emulator threads and dedicated CPUs for vCPU threads. Expanding on the question of non-vCPU emulator threads, one way of looking at the system is to consider that libvirtd is a conceptual part of QEMU that merely happens to run in a separate process instead of a separate thread. IOW, libvirtd is simply a few more non-vCPU emulator thread(s), and as such any CPU placement done for non-vCPU emulator threads should be done likewise for libvirtd threads. Trying to separate non-vCPU threads from libvirtd threads is not a neccessary goal. > Therefore, I thought about discussing a new approach - introducing a small > shim that could communicate with libvirtd in order to start and control the > qemu process that would run on a different container. > > As I see it, the main workflow could be described as follows: > > - The emulator container would start with the shim. > - libvirtd, running in the managerial container, would ask for some > information from the target, e.g. cpuset. > - libvirtd would create the domain xml and would transfer to the shim > everything needed in order to launch the guest. > - The shim, running in the emulator container, would run the > qemu-process. The startup interaction between libvirt and QEMU is pretty complicated code and we have changed it reasonably often, and I forsee a need to keep changing it in future in potentially quite significant/disruptive ways. If we permit use of a external shim as described, that is likely to constrain our ability to make changes to our startup process in the future, which will have a impact on our ability to maintain libvirt in the future. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|