Re: [RFC PATCH kernel] vfio-pci: Allow mapping of MSI-X table to userspace

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 29 Jun 2017 14:42:09 -0600

On Fri, 30 Jun 2017 00:50:57 +1000
Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:

> At the moment VFIO does not allow mapping MSIX BAR to the userspace
> and provides information about what can be mapped via a PCI region
> capability. This ensures that the userspace does not get access
> to MSIX vector table and cannot break MSIX setup. For most platforms
> the userspace also chooses not to map MSIX as it has to emulate it for
> fully virtualized guests.
> 
> This creates a problem when an MSIX BAR is not aligned to a system
> page size and in order to prevent mapping, chunks before and/or after
> the MSIX BAR can't also be mapped to the userspace. If these chunks
> happen to be used very often (for example, Emulex LightPulse FC),
> we see half of expected performance.
> 
> This enables mapping of MSIX BARs to the userspace by advertising
> these BARs as a single chunk (as any other BAR). The userspace
> still is expected not to mmap an MSIX BAR if it want to emulate it
> (QEMU still has this code).
> 
> This does not break protection as at the moment for the Type1 IOMMU,
> vfio_iommu_type1_attach_group() hook fails if MSIX remapping
> is not enabled and this ensures that even ill-behaved device cannot
> harm the host. For sPAPR IOMMU, the hardware always provides such
> isolation by default. The only case when such mapping protection
> is rather needed is allow_unsafe_interrupts but as its name suggests,
> it is not safe anyway.

I don't think allow_unsafe_interrupts is sufficiently un-safe to
entirely disregard it like this either.  It's currently an opt-in, but
we don't consider it unsafe enough that we taint the kernel as a result
of using it.  Allowing user write access to the MSI-X vector table when
there's no IOMMU protection upgrades an MSI injection attack from
somewhat involved to almost trivially easy.  So nak to that.

> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
> 
> ---
> 
> This is to replace 6 patches from Yongji
> "[RESEND PATCH v2 0/6] vfio-pci: Add support for mmapping MSI-X table"
> 
> Previously we tried to add a MSI_REMAP flag to a PCI bus and set it
> via various means for various platforms. It relied on
> MSI_FLAG_IRQ_REMAPPING from msi_domain_info which never made it to
> the upstream. Instead Type1 IOMMU learned to check remapping
> capabilities and refuse to work and it seems like a good idea to use
> this.

Type1 has always known how to do this, but interrupt remapping is a
property on the IOMMU, not the bus, which is why Yongji was trying to
create a PCI bus property to expose to the device.

> How much do we care about allow_unsafe_interrupts? If we care enough,
> I'll need a way to pass the msi_remap flag from
> vfio_iommu_type1_attach_group() down to vfio_pci_bar_rw/vfio_pci_mmap,
> it does not seem easy to make it nice.

As above, I think you're abusing an opt-in security concern as an
excuse to allow an almost trivially easy exploit.  I also prefer
Yongji's approach of exposing the capability through bus properties
rather than creating a back channel between vfio modules as you seem to
suggest.

> What else did I miss?
> 
> Please comment. Thanks.
> ---
>  drivers/vfio/pci/vfio_pci.c      | 41 ++++------------------------------------
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 -----
>  2 files changed, 4 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 324c52e3a1a4..222bc07b9e41 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -559,15 +559,8 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  				struct vfio_info_cap *caps)
>  {
>  	struct vfio_region_info_cap_sparse_mmap *sparse;
> -	size_t end, size;
> -	int nr_areas = 2, i = 0, ret;
> -
> -	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
> -
> -	/* If MSI-X table is aligned to the start or end, only one area */
> -	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
> -	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
> -		nr_areas = 1;
> +	size_t size;
> +	int nr_areas = 1, i = 0, ret;
>  
>  	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
>  
> @@ -577,18 +570,8 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  
>  	sparse->nr_areas = nr_areas;
>  
> -	if (vdev->msix_offset & PAGE_MASK) {
> -		sparse->areas[i].offset = 0;
> -		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
> -		i++;
> -	}
> -
> -	if (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) < end) {
> -		sparse->areas[i].offset = PAGE_ALIGN(vdev->msix_offset +
> -						     vdev->msix_size);
> -		sparse->areas[i].size = end - sparse->areas[i].offset;
> -		i++;
> -	}
> +	sparse->areas[i].offset = 0;
> +	sparse->areas[i].size = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
>  	ret = vfio_info_add_capability(caps, VFIO_REGION_INFO_CAP_SPARSE_MMAP,
>  				       sparse);

This doesn't make any sense to me, why even expose it as a sparse mmap
capability if there's nothing sparse about it?

> @@ -1115,22 +1098,6 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	if (req_start + req_len > phys_len)
>  		return -EINVAL;
>  
> -	if (index == vdev->msix_bar) {
> -		/*
> -		 * Disallow mmaps overlapping the MSI-X table; users don't
> -		 * get to touch this directly.  We could find somewhere
> -		 * else to map the overlap, but page granularity is only
> -		 * a recommendation, not a requirement, so the user needs
> -		 * to know which bits are real.  Requiring them to mmap
> -		 * around the table makes that clear.
> -		 */
> -
> -		/* If neither entirely above nor below, then it overlaps */
> -		if (!(req_start >= vdev->msix_offset + vdev->msix_size ||
> -		      req_start + req_len <= vdev->msix_offset))
> -			return -EINVAL;
> -	}
> -
>  	/*
>  	 * Even though we don't make use of the barmap for the mmap,
>  	 * we need to request the region and the barmap tracks that.
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 357243d76f10..916b72c5c4fc 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -164,11 +164,6 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
>  	} else
>  		io = vdev->barmap[bar];
>  
> -	if (bar == vdev->msix_bar) {
> -		x_start = vdev->msix_offset;
> -		x_end = vdev->msix_offset + vdev->msix_size;
> -	}
> -
>  	done = do_io_rw(io, buf, pos, count, x_start, x_end, iswrite);
>  
>  	if (done >= 0)

Nope, I prefer Yongji's approach, sorry.  Thanks,

Alex