On Fri, Sep 23, 2022 at 11:40:51AM -0400, Laine Stump wrote: > It's been a few years, but my recollection is that before starting a > libvirtd that will run a guest with a vfio device, a privileged process > needs to > > 1) increase the locked memory limit for the user that will be running qemu > (eg. by adding a file with the increased limit to /etc/security/limits.d) > > 2) bind the device to the vfio-pci driver, and > > 3) chown /dev/vfio/$iommu_group to the user running qemu. Here is what is going on to resolve this: 1) iommufd internally supports two ways to account ulimits, the vfio way and the io_uring way. Each FD operates in its own mode. When /dev/iommu is opened the FD defaults to the io_uring way, when /dev/vfio/vfio is opened it uses the VFIO way. This means /dev/vfio/vfio is not a symlink, there is a new kconfig now to make iommufd directly provide a miscdev. 2) There is an ioctl IOMMU_OPTION_RLIMIT_MODE which allows a privileged user to query/set which mode the FD will run in. The idea is that libvirt will open iommufd, the first action will be to set vfio compat mode, and then it will fd pass the fd to qemu and qemu will operate in the correct sandbox. 3) We are working on a cgroup for FOLL_LONGTERM, it is a big job but this should prove a comprehensive resolution to this problem across the kernel and improve the qemu sandbox security. Still TBD, but most likely when the cgroup supports this libvirt would set the rlimit to unlimited, then set new mlock and FOLL_LONGTERM cgroup limits to create the sandbox. Jason