Hi all, KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services. One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap. In KubeVirt world, several libvirtd assumptions do not apply: 1. In Kubernetes environments, swap is usually disabled. (e.g. kubeadm official deployment tool won't even initialize a cluster until you disable it.) This is documented in lots of places, f.e.: https://docs.platform9.com/kubernetes/disabling-swap-kubernetes-node/ (note: while it's vendor docs, regardless it's well known community recommendation.) 2. hotplug is not supported. Domain definition is stable through its whole lifetime. We are working on a series of patches that would remove the need for SYS_RESOURCE capability from the pod running libvirtd: https://github.com/kubevirt/kubevirt/pull/2584 We achieve it by making another, *privileged* component to set RLIMIT_MEMLOCK for libvirtd process using prlimit() syscall, using the value that is higher than the final value libvirtd uses with setrlimit() [Linux kernel will allow to lower the value without the capability.] Since the formula to calculate the actual MEMLOCK value is embedded in libvirt and is not simple to reproduce outside, we pick the upper limit value set for libvirtd process quite conservatively even if ideally we would use the exact same value as libvirtd would do. The estimation code is here: https://github.com/kubevirt/kubevirt/pull/2584/files#diff-6edccf5f0d11c09e7025d4fae3fa6dc6 While the solution works, there are some drawbacks: 1. the value we use for prlimit() is not exactly equal to the final value used by libvirtd; 2. we are doing all this work in environment that is not prone to issues because of disabled swap space. I believe we would benefit from one of the following features on libvirt side (or both): a) expose the memory lock value calculated by libvirtd through libvirt ABI so that we can use it when calling prlimit() on libvirtd process; b) allow to disable setrlimit() calls via libvirtd config file knob or domain definition. Do you think it would be acceptable to have one of these enhancements in libvirtd, or perhaps both, for degenerate cases like KubeVirt? Thanks for attention, Ihar _______________________________________________ libvirt-users mailing list libvirt-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvirt-users