VFIO_IOMMU_TYPE1 inefficiencies for certain use cases

David Stevens <stevensd@xxxxxxxxxxxx> · Fri, 18 Feb 2022 19:17:12 +0900

Hi all,

I'm working on a consumer virtualization project that uses VFIO for
passthrough devices. However, the way it's using them is a little
unusual, and that results in some pretty significant inefficiencies in
the vfio_iommu_type1 implementation. Before going ahead and trying to
address the problems myself, I'm hoping to get some guidance about
what sort of changes might be able to be merged upstream.

The usage pattern that is not well supported by VFIO is many small,
dynamic mappings. We have this pattern because we are using
virtio-iommu to isolate some untrusted passthrough devices within the
guest, and also because for the rest of the passthrough devices we're
using coIOMMU [1] to support overcommit of memory in the host by not
pinning all of the guest's memory. Both of these rely on using
VFIO_IOMMU_MAP_DMA at the page granularity. This results in a lot of
metadata overhead from the struct vfio_dma. At 80 bytes of metadata
per page (actually 128 due to rounding in kmalloc), 1-2% of total
system memory can end up being used for VFIO metadata.

First, is this sort of use case something that upstream wants to address?

If it's something that is worth addressing, here are two possible
approaches I've thought of. I haven't implemented either yet, so there
might be details I'm missing, or the API changes or maintenance costs
might not be acceptable. Both are a little bit different from
VFIO_TYPE1_IOMMU, so they would probably require at least a
VFIO_TYPE1v3_IOMMU type.

1) Add an alternative xarray implementation for vfio_iommu.dma_list
that maintains the iova -> vaddr mapping at the page granularity. Most
of the metadata in struct vfio_dma can be packed into the extra bits
in the vaddr. The two exceptions are vfio_dma.task and
vfio_dma.pfn_list. The lack of space for vfio_dma.task could be
addressed by requiring that all mappings have the same task. Without
vfio_dma.pfn_list, we would lose the refcount maintained by struct
vfio_pfn, which means every call to vfio_iommu_type1_pin_pages would
require re-pinning the page. This might be a bit more inefficient,
although it seems like it should be okay from a correctness
standpoint.

One downside of this approach is that it is only more memory efficient
than the rbtree if the mapping is quite dense, since a struct xa_node
is quite a bit larger than a struct vfio_dma. This would help the most
problematic coIOMMU cases, but it would still leave certain
virtio-iommu cases unaddressed. Also, although most of the struct
vfio_dma metadata could be packed into the xarray today, that might no
longer be the case if more metadata was added in the future.

2) A second alternative would be to drop the VFIO metadata altogether
and basically directly expose the iommu APIs (with the required
locking/validation). This would be incompatible with mediated devices,
and it wouldn't be able to support the various bells and whistles of
the VFIO api. However, I think the core mapping/unmapping logic could
still be shared between the normal struct vfio_dma tree and this
approach. Personally, I'm a little more in favor of this one, since it
completely avoids VFIO memory overhead in both of my use cases.

Do either of those approaches sound like something that might work? If
neither is okay, are there any suggestions for approaches to take?

Thanks,
David

[1] https://www.usenix.org/conference/atc20/presentation/tian