On 11/14/22 9:23 AM, Jason Gunthorpe wrote: > On Thu, Nov 10, 2022 at 10:01:13PM -0500, Matthew Rosato wrote: >> On 11/7/22 7:52 PM, Jason Gunthorpe wrote: >>> This series provides an alternative container layer for VFIO implemented >>> using iommufd. This is optional, if CONFIG_IOMMUFD is not set then it will >>> not be compiled in. >>> >>> At this point iommufd can be injected by passing in a iommfd FD to >>> VFIO_GROUP_SET_CONTAINER which will use the VFIO compat layer in iommufd >>> to obtain the compat IOAS and then connect up all the VFIO drivers as >>> appropriate. >>> >>> This is temporary stopping point, a following series will provide a way to >>> directly open a VFIO device FD and directly connect it to IOMMUFD using >>> native ioctls that can expose the IOMMUFD features like hwpt, future >>> vPASID and dynamic attachment. >>> >>> This series, in compat mode, has passed all the qemu tests we have >>> available, including the test suites for the Intel GVT mdev. Aside from >>> the temporary limitation with P2P memory this is belived to be fully >>> compatible with VFIO. >> >> AFAICT there is no equivalent means to specify >> vfio_iommu_type1.dma_entry_limit when using iommufd; looks like >> we'll just always get the default 65535. > > No, there is no arbitary limit on iommufd Yeah, that's what I suspected. But FWIW, userspace checks the advertised limit via VFIO_IOMMU_GET_INFO / VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL, and this is still being advertised as 65535 when using iommufd. I don't think there is a defined way to return 'ignore this value'. This should go away later when we bind to iommufd directly since QEMU would not be sharing the type1 codepath in userspace. > >> Was this because you envision the limit being not applicable for >> iommufd (limits will be enforced via either means and eventually we >> won't want to ) or was it an oversight? > > The limit here is primarily about limiting userspace abuse of the > interface. > > iommufd is using GFP_KERNEL_ACCOUNT which shifts the responsiblity to > cgroups, which is similar to how KVM works. > > So, for a VM sandbox you'd set a cgroup limit and if a hostile > userspace in the sanbox decides to try to OOM the system it will hit > that limit, regardless of which kernel APIs it tries to abuse. > > This work is not entirely complete as we also need the iommu driver to > use GFP_KERNEL_ACCOUNT for allocations connected to the iommu_domain, > particularly for allocations of the IO page tables themselves - which > can be quite big. > > Jason