Re: [PATCH v8 3/6] vfio iommu: Add support for mediated devices

Kirti Wankhede <kwankhede@xxxxxxxxxx> · Thu, 13 Oct 2016 20:04:43 +0530

On 10/12/2016 3:36 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:34 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>
...

>> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
>> +{
>> +	struct vfio_device *device;
>> +	struct vfio_group *group;
>> +	int ret;
>> +
>> +	device = vfio_device_get_from_dev(dev);
>
> Note how this does dev->iommu_group->vfio_group->vfio_device and then
> we back out one level to get the vfio_group, it's not a terribly
> lightweight path.  Perhaps we should have:
>
> struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> {
>         struct iommu_group *iommu_group;
>         struct vfio_group *group;
>
>         iommu_group = iommu_group_get(dev);
>         if (!iommu_group)
>                 return NULL;
>
>         group = vfio_group_get_from_iommu(iommu_group);
> 	iommu_group_put(iommu_group);
>
> 	return group;
> }
>
> vfio_device_get_from_dev() would make use of this.
>
> Then create a separate:
>
> static int vfio_group_add_container_user(struct vfio_group *group)
> {
>
>> +	if (!atomic_inc_not_zero(&group->container_users)) {
> 		return -EINVAL;
>> +	}
>> +
>> +	if (group->noiommu) {
>> +		atomic_dec(&group->container_users);
> 		return -EPERM;
>> +	}
>> +
>> +	if (!group->container->iommu_driver ||
>> +	    !vfio_group_viable(group)) {
>> +		atomic_dec(&group->container_users);
> 		return -EINVAL;
>> +	}
>> +
> 	return 0;
> }
>
> vfio_group_get_external_user() would be updated to use this.  In fact,
> creating these two functions and updating the existing code to use
> these should be a separate patch.
>

Ok. I'll update.

> Note that your version did not hold a group reference while doing the
> pin/unpin operations below, which seems like a bug.
>

container->group_lock is held for pin/unpin. I think then we don't have
to hold the reference to group, because groups are attached and detached
holding this lock, right?

>> +
>> +err_ret:
>> +	vfio_device_put(device);
>> +	return ERR_PTR(ret);
>> +}
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
local
>> + * domain only.
>> + * @dev [in] : device
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>> +		    long npage, int prot, unsigned long *phys_pfn)
>> +{
>> +	struct vfio_container *container;
>> +	struct vfio_group *group;
>> +	struct vfio_iommu_driver *driver;
>> +	ssize_t ret = -EINVAL;
>> +
>> +	if (!dev || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	group = vfio_group_from_dev(dev);
>> +	if (IS_ERR(group))
>> +		return PTR_ERR(group);
>
> As suggested above:
>
> 	group = vfio_group_get_from_dev(dev);
> 	if (!group)
> 		return -ENODEV;
>
> 	ret = vfio_group_add_container_user(group)
> 	if (ret)
> 		vfio_group_put(group);
> 		return ret;
> 	}
>

Ok.

>> +
>> +	container = group->container;
>> +	if (IS_ERR(container))
>> +		return PTR_ERR(container);
>> +
>> +	down_read(&container->group_lock);
>> +
>> +	driver = container->iommu_driver;
>> +	if (likely(driver && driver->ops->pin_pages))
>> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>> +					     npage, prot, phys_pfn);
>> +
>> +	up_read(&container->group_lock);
>> +	vfio_group_try_dissolve_container(group);
>
> Even if you're considering that the container_user reference holds the
> driver, I think we need a group reference throughout this and this
> should end with a vfio_group_put(group);
>

Same as I mentioned above, container->group_lock is held here.

...

>> +
>> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned
long *pfn,
>> +					 long npage)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
>
> We need iommu->lock here, right?
>

Oh, yes.

>> +	domain = iommu->local_domain;
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, pfn[i]);
>> +		if (p)
>> +			unlocked += vfio_unpin_pfn(domain, p, true);
>> +
>> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>
> We hold this mutex outside the loop in the pin unwind case, why is it
> different here?
>

pin_unwind is error condition, so should be done in one go.
Here this is not error case. Holding lock for long could block other
threads if there are multiple threads.

>> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct
vfio_dma *dma,
>> +			    size_t map_size)
>> +{
>> +	dma_addr_t iova = dma->iova;
>> +	unsigned long vaddr = dma->vaddr;
>> +	size_t size = map_size, dma_size = 0;
>> +	long npage;
>> +	unsigned long pfn;
>> +	int ret = 0;
>> +
>> +	while (size) {
>> +		/* Pin a contiguous chunk of memory */
>> +		npage = __vfio_pin_pages_remote(vaddr + dma_size,
>> +						size >> PAGE_SHIFT, dma->prot,
>> +						&pfn);
>> +		if (npage <= 0) {
>> +			WARN_ON(!npage);
>> +			ret = (int)npage;
>> +			break;
>> +		}
>> +
>> +		/* Map it! */
>> +		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
>> +				     dma->prot);
>> +		if (ret) {
>> +			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
>> +			break;
>> +		}
>> +
>> +		size -= npage << PAGE_SHIFT;
>> +		dma_size += npage << PAGE_SHIFT;
>> +	}
>> +
>> +	if (ret)
>> +		vfio_remove_dma(iommu, dma);
>
>
> There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> accurate to the point of failure, it's not updated until the success
> branch below, so it's never going to unmap/unpin anything.
>

Ops, yes. I'll fix this.

>> +	else {
>> +		dma->size = dma_size;
>> +		dma->iommu_mapped = true;
>> +		vfio_update_accounting(iommu, dma);
>
> I'm confused how this works, when called from vfio_dma_do_map() we're
> populating a vfio_dma, that is we're populating part of the iova space
> of the device.  How could we have pinned pfns in the local address
> space that overlap that?  It would be invalid to have such pinned pfns
> since that part of the iova space was not previously mapped.
>
> Another issue is that if there were existing overlaps, userspace would
> need to have locked memory limits sufficient for this temporary double
> accounting.  I'm not sure how they'd come up with heuristics to handle
> that since we're potentially looking at the bulk of VM memory in a
> single vfio_dma entry.
>

I see that when QEMU boots a VM, in the case when first vGPU device is
attached and then pass through device is attached, then on first call to
vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
through device is attached, all mappings are unmapped and then again
vfio_dma_do_map() is called. At this moment IOMMU capable domain is
present and so pin and iommu_mmap() on all sys mem is done. Now in
between these two device attach, if any pages are pinned by vendor
driver, then accounting should be updated.

>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  			   struct vfio_iommu_type1_dma_map *map)
>>  {
>>  	dma_addr_t iova = map->iova;
>>  	unsigned long vaddr = map->vaddr;
>>  	size_t size = map->size;
>> -	long npage;
>>  	int ret = 0, prot = 0;
>>  	uint64_t mask;
>>  	struct vfio_dma *dma;
>> -	unsigned long pfn;
>>
>>  	/* Verify that none of our __u64 fields overflow */
>>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu
*iommu,
>>  	/* Insert zero-sized and grow as we map chunks of it */
>>  	vfio_link_dma(iommu, dma);
>>
>> -	while (size) {
>> -		/* Pin a contiguous chunk of memory */
>> -		npage = vfio_pin_pages(vaddr + dma->size,
>> -				       size >> PAGE_SHIFT, prot, &pfn);
>> -		if (npage <= 0) {
>> -			WARN_ON(!npage);
>> -			ret = (int)npage;
>> -			break;
>> -		}
>> -
>> -		/* Map it! */
>> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>> -		if (ret) {
>> -			vfio_unpin_pages(pfn, npage, prot, true);
>> -			break;
>> -		}
>> -
>> -		size -= npage << PAGE_SHIFT;
>> -		dma->size += npage << PAGE_SHIFT;
>> -	}
>> -
>> -	if (ret)
>> -		vfio_remove_dma(iommu, dma);
>> +	/* Don't pin and map if container doesn't contain IOMMU capable
domain*/
>> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>> +		dma->size = size;
>> +	else
>> +		ret = vfio_pin_map_dma(iommu, dma, size);
>>
>>  	mutex_unlock(&iommu->lock);
>>  	return ret;
>> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>>  	n = rb_first(&iommu->dma_list);
>>
>> -	/* If there's not a domain, there better not be any mappings */
>> -	if (WARN_ON(n && !d))
>> -		return -EINVAL;
>> -
>>  	for (; n; n = rb_next(n)) {
>>  		struct vfio_dma *dma;
>>  		dma_addr_t iova;
>> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>  		iova = dma->iova;
>>
>>  		while (iova < dma->iova + dma->size) {
>> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
>> +			phys_addr_t phys;
>>  			size_t size;
>>
>> -			if (WARN_ON(!phys)) {
>> -				iova += PAGE_SIZE;
>> -				continue;
>> -			}
>> +			if (dma->iommu_mapped) {
>> +				phys = iommu_iova_to_phys(d->domain, iova);
>> +
>> +				if (WARN_ON(!phys)) {
>> +					iova += PAGE_SIZE;
>> +					continue;
>> +				}
>>
>> -			size = PAGE_SIZE;
>> +				size = PAGE_SIZE;
>>
>> -			while (iova + size < dma->iova + dma->size &&
>> -			       phys + size == iommu_iova_to_phys(d->domain,
>> +				while (iova + size < dma->iova + dma->size &&
>> +				    phys + size == iommu_iova_to_phys(d->domain,
>>  								 iova + size))
>> -				size += PAGE_SIZE;
>> +					size += PAGE_SIZE;
>> +			} else {
>> +				unsigned long pfn;
>> +				unsigned long vaddr = dma->vaddr +
>> +						     (iova - dma->iova);
>> +				size_t n = dma->iova + dma->size - iova;
>> +				long npage;
>> +
>> +				npage = __vfio_pin_pages_remote(vaddr,
>> +								n >> PAGE_SHIFT,
>> +								dma->prot,
>> +								&pfn);
>> +				if (npage <= 0) {
>> +					WARN_ON(!npage);
>> +					ret = (int)npage;
>> +					return ret;
>> +				}
>> +
>> +				phys = pfn << PAGE_SHIFT;
>> +				size = npage << PAGE_SHIFT;
>> +			}
>>
>>  			ret = iommu_map(domain->domain, iova, phys,
>>  					size, dma->prot | domain->prot);
>> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>
>>  			iova += size;
>>  		}
>> +
>> +		if (!dma->iommu_mapped) {
>> +			dma->iommu_mapped = true;
>> +			vfio_update_accounting(iommu, dma);
>> +		}
>
> This is the case where we potentially have pinned pfns and we've added
> an iommu mapped device and need to adjust accounting.  But we've fully
> pinned and accounted the entire iommu mapped space while still holding
> the accounting for any pfn mapped space.  So for a time, assuming some
> pfn pinned pages, we have duplicate accounting.  How does userspace
> deal with that?  For instance, if I'm using an mdev device where the
> vendor driver has pinned 512MB of guest memory, then I hot-add an
> assigned NIC and the entire VM address space gets pinned, that pinning
> will fail unless my locked memory limits are at least 512MB in excess
> of my VM size.  Additionally, the user doesn't know how much memory the
> vendor driver is going to pin, it might be the whole VM address space,
> so the user would need 2x the locked memory limits.
>

Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
__vfio_pin_pages_remote() to check if page which is pinned is already
accounted in __vfio_pin_pages_remote() itself.

>>  	}
>>
>>  	return 0;
>> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct
vfio_domain *domain)
>>  	__free_pages(pages, order);
>>  }
>>
>> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
>> +				   struct iommu_group *iommu_group)
>> +{
>> +	struct vfio_group *g;
>> +
>> +	list_for_each_entry(g, &domain->group_list, next) {
>> +		if (g->iommu_group == iommu_group)
>> +			return g;
>> +	}
>> +
>> +	return NULL;
>> +}
>
> It would make review easier if changes like splitting this into a
> separate function with no functional change on the calling path could
> be a separate patch.
>

OK.

Thanks,
Kirti

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html