Re: [PATCH kernel] vfio-pci: Allow mapping MSIX BAR

Alexey Kardashevskiy <aik@xxxxxxxxx> · Tue, 12 Dec 2017 13:00:48 +1100

On 12/12/17 11:09, Alex Williamson wrote:
> On Fri,  8 Dec 2017 15:18:29 +1100
> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:
> 
>> By default VFIO disables mapping of MSIX BAR to the userspace as
>> the userspace may program it in a way allowing spurious interrupts;
>> instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
>> In order to eliminate guessing from the userspace about what is
>> mmapable, VFIO also advertises a sparse list of regions allowed to mmap.
>>
>> This works fine as long as the system page size equals to the MSIX
>> alignment requirement which is 4KB. However with a bigger page size
>> the existing code prohibits mapping non-MSIX parts of a page with MSIX
>> structures so these parts have to be emulated via slow reads/writes on
>> a VFIO device fd. If these emulated bits are accessed often, this has
>> serious impact on performance.
>>
>> This allows mmap of the entire BAR containing MSIX vector table.
>>
>> This removes the sparse capability for PCI devices as it becomes useless.
>>
>> As the userspace needs to know for sure whether mmapping of the MSIX
>> vector containing data can succeed, this adds a new capability -
>> VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace
>> that the entire BAR can be mmapped.
>>
>> This does not touch the MSIX mangling in the BAR read/write handlers as
>> we are doing this just to enable direct access to non MSIX registers.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
>> ---
>>  include/uapi/linux/vfio.h   | 10 +++++++
>>  drivers/vfio/pci/vfio_pci.c | 65 +++++++--------------------------------------
>>  drivers/vfio/vfio.c         |  9 +++++++
>>  3 files changed, 29 insertions(+), 55 deletions(-)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index ae46105..0fb4407 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -300,6 +300,16 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>  
>> +/*
>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>> + * which allows direct access to non-MSIX registers which happened to be within
>> + * the same system page.
>> + *
>> + * Even though the userspace gets direct access to the MSIX data, the existing
>> + * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
>> + */
>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>> +
>>  /**
>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>   *				    struct vfio_irq_info)
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index a98681d..9bcc1e1 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -565,46 +565,16 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>>  	return walk.ret;
>>  }
>>  
>> -static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>> -				struct vfio_info_cap *caps)
>> +static int msix_sparse_msix_mmappable_cap(struct vfio_pci_device *vdev,
>> +		struct vfio_info_cap *caps)
>>  {
>> -	struct vfio_region_info_cap_sparse_mmap *sparse;
>> -	size_t end, size;
>> -	int nr_areas = 2, i = 0, ret;
>> +	struct vfio_info_cap_header header = {
>> +		.id = VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>> +		.version = 1
>> +	};
>>  
>> -	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>> -
>> -	/* If MSI-X table is aligned to the start or end, only one area */
>> -	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
>> -	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
>> -		nr_areas = 1;
>> -
>> -	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
>> -
>> -	sparse = kzalloc(size, GFP_KERNEL);
>> -	if (!sparse)
>> -		return -ENOMEM;
>> -
>> -	sparse->nr_areas = nr_areas;
>> -
>> -	if (vdev->msix_offset & PAGE_MASK) {
>> -		sparse->areas[i].offset = 0;
>> -		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
>> -		i++;
>> -	}
>> -
>> -	if (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) < end) {
>> -		sparse->areas[i].offset = PAGE_ALIGN(vdev->msix_offset +
>> -						     vdev->msix_size);
>> -		sparse->areas[i].size = end - sparse->areas[i].offset;
>> -		i++;
>> -	}
>> -
>> -	ret = vfio_info_add_capability(caps, VFIO_REGION_INFO_CAP_SPARSE_MMAP,
>> -				       sparse);
>> -	kfree(sparse);
>> -
>> -	return ret;
>> +	return vfio_info_add_capability(caps,
>> +			VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, &header);
>>  }
>>  
>>  int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
>> @@ -695,7 +665,8 @@ static long vfio_pci_ioctl(void *device_data,
>>  			if (vdev->bar_mmap_supported[info.index]) {
>>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>>  				if (info.index == vdev->msix_bar) {
>> -					ret = msix_sparse_mmap_cap(vdev, &caps);
>> +					ret = msix_sparse_msix_mmappable_cap(
>> +							vdev, &caps);
>>  					if (ret)
>>  						return ret;
>>  				}
>> @@ -1125,22 +1096,6 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>>  	if (req_start + req_len > phys_len)
>>  		return -EINVAL;
>>  
>> -	if (index == vdev->msix_bar) {
>> -		/*
>> -		 * Disallow mmaps overlapping the MSI-X table; users don't
>> -		 * get to touch this directly.  We could find somewhere
>> -		 * else to map the overlap, but page granularity is only
>> -		 * a recommendation, not a requirement, so the user needs
>> -		 * to know which bits are real.  Requiring them to mmap
>> -		 * around the table makes that clear.
>> -		 */
>> -
>> -		/* If neither entirely above nor below, then it overlaps */
>> -		if (!(req_start >= vdev->msix_offset + vdev->msix_size ||
>> -		      req_start + req_len <= vdev->msix_offset))
>> -			return -EINVAL;
>> -	}
>> -
>>  	/*
>>  	 * Even though we don't make use of the barmap for the mmap,
>>  	 * we need to request the region and the barmap tracks that.
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index f5a86f6..376b4e7 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1898,6 +1898,7 @@ int vfio_info_add_capability(struct vfio_info_cap *caps, int cap_type_id,
>>  			     void *cap_type)
>>  {
>>  	int ret = -EINVAL;
>> +	struct vfio_info_cap_header *header, *cap;
>>  
>>  	if (!cap_type)
>>  		return 0;
>> @@ -1910,6 +1911,14 @@ int vfio_info_add_capability(struct vfio_info_cap *caps, int cap_type_id,
>>  	case VFIO_REGION_INFO_CAP_TYPE:
>>  		ret = region_type_cap(caps, cap_type);
>>  		break;
>> +
>> +	case VFIO_REGION_INFO_CAP_MSIX_MAPPABLE:
>> +		cap = cap_type;
>> +		header = vfio_info_cap_add(caps, sizeof(*header), cap->id,
>> +				cap->version);
>> +		if (IS_ERR(header))
>> +			return PTR_ERR(header);
>> +		return 0;
>>  	}
>>  
>>  	return ret;
> 
> Why does this capability need to use the interface differently than the
> existing callers?  The existing users pass a third arg with the data
> structure, minus the header, filled in. 

The third arg points to the entire capability including the
vfio_info_cap_header, why minus (I read "minus" as if
vfio_region_info_cap_type did not have header embedded into it)? The header
is not initialized though.

> Logically, for a capability
> which is only a header, the third arg could be NULL.  We could then
> have a function:
> 
> int header_cap(struct vfio_info_cap *caps,
> 	       int cap_type_id, int cap_type_ver)

This is what I miss in the existing design - the caller of
vfio_info_add_capability passes the entire capability structure including
the id and version fields but it also passes the id separately and hard
codes the version to 1 in all current handlers, and overwrites the header
values. It should either use the passed header or take a version explicitly
imho, my try was to use the header but yeah, the cumbersome one.

Why was the version left out in the first place? Or a header not used (I
assume this is because we still to have initialize "next" so it would be
incomplete anyway so just forget that there is a header)?

> which does what you're doing here but plays nicely and sets ret and
> breaks out to the common return.  Please also post the QEMU changes so
> we can see the end to end changes.  Thanks,

I will post it but since we are simply enabling mmap, it is not much to
look at there...

-- 
Alexey