Re: RLIMIT_MEMLOCK in container environment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote:

On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
Hi all,

KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes
API resources. In this case, libvirtd is running inside an
unprivileged pod, with some host mounts / capabilities added to the
pod, needed by libvirtd and other services.

One of the capabilities libvirtd requires for successful startup
inside a pod is SYS_RESOURCE. This capability is used to adjust
RLIMIT_MEMLOCK ulimit value depending on devices attached to the
managed guest, both on startup and during hotplug. AFAIU the need to
lock the memory is to avoid pages being pushed out from RAM into swap.


I recall successfully testing GPU assignment from an unprivileged libvirtd several years ago by setting a high enough ulimit for the uid used to run libvirtd in advance (. I think we check if the current setting is high enough, and don't try to set it unless we think we need to.

If I understand you correctly, you're saying that in your case it's okay for the memlock limit to be lower than we try to set it to, because swap is disabled anyway, is that correct?


Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's
something in the XML that requires it - one of

You are right, sorry. We add SYS_RESOURCE only for particular domains.


  - hard limit memory value is present
  - host PCI device passthrough is requested

We are using passthrough

(If you want to make Alex happy, use the term "VFIO device assignment" rather than passthrough :-).)

to pass SR-IOV NIC VFs into guests. We also
plan to do the same for GPUs in the near future.

>>> I believe we would benefit from one of the following features on
>>> libvirt side (or both):
>>>
>>> a) expose the memory lock value calculated by libvirtd through
>>> libvirt ABI so that we can use it when calling prlimit() on libvirtd
>>> process;
>>> b) allow to disable setrlimit() calls via libvirtd config file knob
>>> or domain definition.

(b) sounds much more reasonable, as long as qemu doesn't complain (I don't know whether or not it checks)

Slightly related to this - I'm currently working on patches to avoid making any ioctl calls that would fail in an unprivileged libvirtd when using tap/macvtap devices. ATM, I'm doing this by adding an attribute "unmanaged='yes'" to the interface <target> element. The idea is that if someone sets unmanaged='yes', they're stating that the caller (i.e. kubevirt) is responsible for all device setup, and that libvirt should just use it without further setup. A similar approach could be applied to hostdev devices - if unmanaged is set, we assume that the caller has done everything to make the associated device usable.

(Of course this all makes me realize the inanity of adding a <target dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So to prevent setting the locklimit for hostdev, would we make a new setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I *hate* trying to make config consistent :-/)

(alternately, we could just automatically fail the attempt to set the lock limit in a graceful manner and allow the guest to continue)

BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather than <interface type='hostdev'>, correct? The latter would require that you have enough capabilities to set MAC addresses on the VFs (that's the entire point of using <interface type='hostdev'> instead of plain <hostdev>)

_______________________________________________
libvirt-users mailing list
libvirt-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvirt-users




[Index of Archives]     [Virt Tools]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux