Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 21 Oct 2022 16:56:47 -0300

On Fri, Sep 23, 2022 at 11:40:51AM -0400, Laine Stump wrote:
> It's been a few years, but my recollection is that before starting a
> libvirtd that will run a guest with a vfio device, a privileged process
> needs to
> 
> 1) increase the locked memory limit for the user that will be running qemu
> (eg. by adding a file with the increased limit to /etc/security/limits.d)
> 
> 2) bind the device to the vfio-pci driver, and
> 
> 3) chown /dev/vfio/$iommu_group to the user running qemu.

Here is what is going on to resolve this:

1) iommufd internally supports two ways to account ulimits, the vfio
   way and the io_uring way. Each FD operates in its own mode.

   When /dev/iommu is opened the FD defaults to the io_uring way, when
   /dev/vfio/vfio is opened it uses the VFIO way. This means
   /dev/vfio/vfio is not a symlink, there is a new kconfig
   now to make iommufd directly provide a miscdev.

2) There is an ioctl IOMMU_OPTION_RLIMIT_MODE which allows a
   privileged user to query/set which mode the FD will run in.

   The idea is that libvirt will open iommufd, the first action will
   be to set vfio compat mode, and then it will fd pass the fd to
   qemu and qemu will operate in the correct sandbox.

3) We are working on a cgroup for FOLL_LONGTERM, it is a big job but
   this should prove a comprehensive resolution to this problem across
   the kernel and improve the qemu sandbox security.

   Still TBD, but most likely when the cgroup supports this libvirt
   would set the rlimit to unlimited, then set new mlock and
   FOLL_LONGTERM cgroup limits to create the sandbox.

Jason