Re: [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices

Kirti Wankhede <kwankhede@xxxxxxxxxx> · Tue, 28 Jun 2016 18:32:44 +0530

On 6/22/2016 9:16 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:48 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> 
>>  
>>  struct vfio_iommu {
>>  	struct list_head	domain_list;
>> +	struct vfio_domain	*mediated_domain;
> 
> I'm not really a fan of how this is so often used to special case the
> code...
> 
>>  	struct mutex		lock;
>>  	struct rb_root		dma_list;
>>  	bool			v2;
>> @@ -67,6 +69,13 @@ struct vfio_domain {
>>  	struct list_head	group_list;
>>  	int			prot;		/* IOMMU_CACHE */
>>  	bool			fgsp;		/* Fine-grained super pages */
>> +
>> +	/* Domain for mediated device which is without physical IOMMU */
>> +	bool			mediated_device;
> 
> But sometimes we use this to special case the code and other times we
> use domain_list being empty.  I thought the argument against pulling
> code out to a shared file was that this approach could be made
> maintainable.
> 

Functions where struct vfio_domain *domain is argument which are
intended to perform for that domain only, checked if
(domain->mediated_device), like map_try_harder(), vfio_iommu_replay(),
vfio_test_domain_fgsp(). Checks in these functions can be removed but
then it would be callers responsibility to make sure that they don't
call these functions for mediated_domain.
Whereas functions where struct vfio_iommu *iommu is argument and
domain_list is traversed to find domain or perform for each domain in
domain_list, checked if (list_empty(&iommu->domain_list)), like
vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map().

>> +
>> +	struct mm_struct	*mm;
>> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
>> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> 
> Seems like we could reduce overhead for the existing use cases by just
> adding a pointer here and making these last 3 entries part of the
> structure that gets pointed to.  Existence of the pointer would replace
> @mediated_device.
>

Ok.

>>  };
>>  
>>  struct vfio_dma {
>> @@ -79,10 +88,26 @@ struct vfio_dma {
>>  
>>  struct vfio_group {
>>  	struct iommu_group	*iommu_group;
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> 
> Where does CONFIG_MDEV_MODULE come from?
> 
> Plus, all the #ifdefs... <cringe>
> 

Config option MDEV is tristate and when selected as module
CONFIG_MDEV_MODULE is set in include/generated/autoconf.h.
Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device()
are only available when MDEV option is selected as built-in or modular.
If MDEV option is not selected, vfio_iommu_type1 modules should still
work for device direct assignment. If these #ifdefs are not there
vfio_iommu_type1 module fails to load with undefined symbols when MDEV
is not selected.

>> +	struct mdev_device	*mdev;
> 
> This gets set on attach_group where we use the iommu_group to lookup
> the mdev, so why can't we do that on the other paths that make use of
> this?  I think this is just holding a reference.
> 

mdev is retrieved from attach_group for 2 reasons:
1. to increase the ref count of mdev, mdev_get_device_by_group(), when
its iommu_group is attached. That should be decremented, by
mdev_put_device(), from detach while detaching its iommu_group. This is
make sure that mdev is not freed until it's iommu_group is detached from
the container.

2. save reference to iommu_data so that vendor driver would use to call
vfio_pin_pages() and vfio_unpin_pages(). More details below.

>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> +			 int prot, unsigned long *pfn)
>>  {
>>  	struct page *page[1];
>>  	struct vm_area_struct *vma;
>> +	struct mm_struct *local_mm = mm;
>>  	int ret = -EFAULT;
>>  
>> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> +	if (!local_mm && !current->mm)
>> +		return -ENODEV;
>> +
>> +	if (!local_mm)
>> +		local_mm = current->mm;
> 
> The above would be much more concise if we just initialized local_mm
> as: mm ? mm : current->mm
> 
>> +
>> +	down_read(&local_mm->mmap_sem);
>> +	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> +				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
> 
> Um, the comment for get_user_pages_remote says:
> 
> "See also get_user_pages_fast, for performance critical applications."
> 
> So what penalty are we imposing on the existing behavior of type1
> here?  Previously we only needed to acquire mmap_sem if
> get_user_pages_fast() didn't work, so the existing use case seems to be
> compromised.
>

Yes.
get_user_pages_fast() pins pages from current->mm, but for mediated
device mm could be different than current->mm.

This penalty for existing behavior could be avoided by:
if (!mm && current->mm)
    get_user_pages_fast(); //take fast path
else
    get_user_pages_remote(); // take slow path

>> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
>> +		     int prot)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
>> +	if (!iommu->mediated_domain)
>> +		return -EINVAL;
>> +
>> +	domain = iommu->mediated_domain;
> 
> Again, domain is already validated here.
> 
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, *(pfn + i));
> 
> Why are we using array indexing above and array math here?  Were these
> functions written by different people?
> 

No, input argument to vfio_unpin_pages() was always array of pfns to be
unpinned.

>> +		if (!p)
>> +			continue;
> 
> Hmm, this seems like more of a bad thing than a continue.
> 

Caller of vfio_unpin_pages() are other modules. I feel its better to do
sanity check than crash.

>>  
>>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  
>>  	if (!dma->size)
>>  		return;
>> +
>> +	if (list_empty(&iommu->domain_list))
>> +		return;
> 
> Huh?  This would be a serious consistency error if this happened for
> the existing use case.
>

This will not happen for existing use case, i.e. device direct
assignment. This case is true when there is only mediated device
assigned and there are no direct assigned devices.

>> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	uint64_t mask;
>>  	struct vfio_dma *dma;
>>  	unsigned long pfn;
>> +	struct vfio_domain *domain = NULL;
>>  
>>  	/* Verify that none of our __u64 fields overflow */
>>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	/* Insert zero-sized and grow as we map chunks of it */
>>  	vfio_link_dma(iommu, dma);
>>  
>> +	/*
>> +	 * Skip pin and map if and domain list is empty
>> +	 */
>> +	if (list_empty(&iommu->domain_list)) {
>> +		dma->size = size;
>> +		goto map_done;
>> +	}
> 
> Again, this would be a serious consistency error for the existing use
> case.  Let's use indicators that are explicit.
>

Why? for existing use case (i.e. direct device assignment) domain_list
will not be empty, domain_list will only be empty when there is mediated
device assigned and no direct device assigned.

>>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  					 struct iommu_group *iommu_group)
>>  {
>>  	struct vfio_iommu *iommu = iommu_data;
>> -	struct vfio_group *group, *g;
>> +	struct vfio_group *group;
>>  	struct vfio_domain *domain, *d;
>>  	struct bus_type *bus = NULL;
>>  	int ret;
>> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	mutex_lock(&iommu->lock);
>>  
>>  	list_for_each_entry(d, &iommu->domain_list, next) {
>> -		list_for_each_entry(g, &d->group_list, next) {
>> -			if (g->iommu_group != iommu_group)
>> -				continue;
>> +		if (is_iommu_group_present(d, iommu_group)) {
>> +			mutex_unlock(&iommu->lock);
>> +			return -EINVAL;
>> +		}
>> +	}
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (iommu->mediated_domain) {
>> +		if (is_iommu_group_present(iommu->mediated_domain,
>> +					   iommu_group)) {
>>  			mutex_unlock(&iommu->lock);
>>  			return -EINVAL;
>>  		}
>>  	}
>> +#endif
>>  
>>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
>>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	if (ret)
>>  		goto out_free;
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
>> +		struct mdev_device *mdev = NULL;
> 
> Unnecessary initialization.
> 
>> +
>> +		mdev = mdev_get_device_by_group(iommu_group);
>> +		if (!mdev)
>> +			goto out_free;
>> +
>> +		mdev->iommu_data = iommu;
> 
> This looks rather sketchy to me, we don't have a mediated driver in
> this series, but presumably the driver blindly calls vfio_pin_pages
> passing mdev->iommu_data and hoping that it's either NULL to generate
> an error or relevant to this iommu backend.  How would we add a second
> mediated driver iommu backend?  We're currently assuming the user
> configured this backend.  

If I understand correctly, your question is if two different mediated
devices are assigned to same container. In such case, the two mediated
devices will have different iommu_groups and will be added to
mediated_domain's group_list (iommu->mediated_domain->group_list).

> Should vfio_pin_pages instead have a struct
> device* parameter from which we would lookup the iommu_group and get to
> the vfio_domain?  That's a bit heavy weight, but we need something
> along those lines.
> 

There could be multiple mdev devices from same mediated vendor driver in
one container. In that case, that vendor driver need reference of
container or container->iommu_data to pin and unpin pages.
Similarly, there could be mutiple mdev devices from different mediated
vendor driver in one container, in that case both vendor driver need
reference to container or container->iommu_data to pin and unpin pages
in their driver.

>> +		mdev->iommu_data = iommu;
With the above line, a reference to container->iommu_data is kept in
mdev structure when the iommu_group is attached to a container so that
vendor drivers can find reference to pin and unpin pages.

If struct device* is passed as an argument to vfio_pin_pages, to find
reference to container of struct device *dev, have to find
vfio_device/vfio_group from dev that means traverse vfio.group_list for
each pin and unpin call. This list would be long when there are many
mdev devices in the system.

Is there any better way to find reference to container from struct device*?

>>  
>> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  	struct vfio_domain *domain, *domain_tmp;
>>  	struct vfio_group *group, *group_tmp;
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (iommu->mediated_domain) {
>> +		domain = iommu->mediated_domain;
>> +		list_for_each_entry_safe(group, group_tmp,
>> +					 &domain->group_list, next) {
>> +			if (group->mdev) {
>> +				group->mdev->iommu_data = NULL;
>> +				mdev_put_device(group->mdev);
>> +			}
>> +			list_del(&group->next);
>> +			kfree(group);
>> +		}
>> +		vfio_iommu_unpin_api_domain(domain);
>> +		kfree(domain);
>> +		iommu->mediated_domain = NULL;
>> +	}
>> +#endif
> 
> I'm not really seeing how this is all that much more maintainable than
> what was proposed previously, has this aspect been worked on since last
> I reviewed this patch?
> 

There aren't many changes from v4 to v5 version of this patch.
Can you more specific on you concerns about maintainability? I'll
definitely address your concerns.

Thanks,
Kirti
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html