On 6/22/2016 9:16 AM, Alex Williamson wrote: > On Mon, 20 Jun 2016 22:01:48 +0530 > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > >> >> struct vfio_iommu { >> struct list_head domain_list; >> + struct vfio_domain *mediated_domain; > > I'm not really a fan of how this is so often used to special case the > code... > >> struct mutex lock; >> struct rb_root dma_list; >> bool v2; >> @@ -67,6 +69,13 @@ struct vfio_domain { >> struct list_head group_list; >> int prot; /* IOMMU_CACHE */ >> bool fgsp; /* Fine-grained super pages */ >> + >> + /* Domain for mediated device which is without physical IOMMU */ >> + bool mediated_device; > > But sometimes we use this to special case the code and other times we > use domain_list being empty. I thought the argument against pulling > code out to a shared file was that this approach could be made > maintainable. > Functions where struct vfio_domain *domain is argument which are intended to perform for that domain only, checked if (domain->mediated_device), like map_try_harder(), vfio_iommu_replay(), vfio_test_domain_fgsp(). Checks in these functions can be removed but then it would be callers responsibility to make sure that they don't call these functions for mediated_domain. Whereas functions where struct vfio_iommu *iommu is argument and domain_list is traversed to find domain or perform for each domain in domain_list, checked if (list_empty(&iommu->domain_list)), like vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map(). >> + >> + struct mm_struct *mm; >> + struct rb_root pfn_list; /* pinned Host pfn list */ >> + struct mutex pfn_list_lock; /* mutex for pfn_list */ > > Seems like we could reduce overhead for the existing use cases by just > adding a pointer here and making these last 3 entries part of the > structure that gets pointed to. Existence of the pointer would replace > @mediated_device. > Ok. >> }; >> >> struct vfio_dma { >> @@ -79,10 +88,26 @@ struct vfio_dma { >> >> struct vfio_group { >> struct iommu_group *iommu_group; >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE) > > Where does CONFIG_MDEV_MODULE come from? > > Plus, all the #ifdefs... <cringe> > Config option MDEV is tristate and when selected as module CONFIG_MDEV_MODULE is set in include/generated/autoconf.h. Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device() are only available when MDEV option is selected as built-in or modular. If MDEV option is not selected, vfio_iommu_type1 modules should still work for device direct assignment. If these #ifdefs are not there vfio_iommu_type1 module fails to load with undefined symbols when MDEV is not selected. >> + struct mdev_device *mdev; > > This gets set on attach_group where we use the iommu_group to lookup > the mdev, so why can't we do that on the other paths that make use of > this? I think this is just holding a reference. > mdev is retrieved from attach_group for 2 reasons: 1. to increase the ref count of mdev, mdev_get_device_by_group(), when its iommu_group is attached. That should be decremented, by mdev_put_device(), from detach while detaching its iommu_group. This is make sure that mdev is not freed until it's iommu_group is detached from the container. 2. save reference to iommu_data so that vendor driver would use to call vfio_pin_pages() and vfio_unpin_pages(). More details below. >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn) >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, >> + int prot, unsigned long *pfn) >> { >> struct page *page[1]; >> struct vm_area_struct *vma; >> + struct mm_struct *local_mm = mm; >> int ret = -EFAULT; >> >> - if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) { >> + if (!local_mm && !current->mm) >> + return -ENODEV; >> + >> + if (!local_mm) >> + local_mm = current->mm; > > The above would be much more concise if we just initialized local_mm > as: mm ? mm : current->mm > >> + >> + down_read(&local_mm->mmap_sem); >> + if (get_user_pages_remote(NULL, local_mm, vaddr, 1, >> + !!(prot & IOMMU_WRITE), 0, page, NULL) == 1) { > > Um, the comment for get_user_pages_remote says: > > "See also get_user_pages_fast, for performance critical applications." > > So what penalty are we imposing on the existing behavior of type1 > here? Previously we only needed to acquire mmap_sem if > get_user_pages_fast() didn't work, so the existing use case seems to be > compromised. > Yes. get_user_pages_fast() pins pages from current->mm, but for mediated device mm could be different than current->mm. This penalty for existing behavior could be avoided by: if (!mm && current->mm) get_user_pages_fast(); //take fast path else get_user_pages_remote(); // take slow path >> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage, >> + int prot) >> +{ >> + struct vfio_iommu *iommu = iommu_data; >> + struct vfio_domain *domain = NULL; >> + long unlocked = 0; >> + int i; >> + >> + if (!iommu || !pfn) >> + return -EINVAL; >> + >> + if (!iommu->mediated_domain) >> + return -EINVAL; >> + >> + domain = iommu->mediated_domain; > > Again, domain is already validated here. > >> + >> + for (i = 0; i < npage; i++) { >> + struct vfio_pfn *p; >> + >> + /* verify if pfn exist in pfn_list */ >> + p = vfio_find_pfn(domain, *(pfn + i)); > > Why are we using array indexing above and array math here? Were these > functions written by different people? > No, input argument to vfio_unpin_pages() was always array of pfns to be unpinned. >> + if (!p) >> + continue; > > Hmm, this seems like more of a bad thing than a continue. > Caller of vfio_unpin_pages() are other modules. I feel its better to do sanity check than crash. >> >> static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma) >> { >> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma) >> >> if (!dma->size) >> return; >> + >> + if (list_empty(&iommu->domain_list)) >> + return; > > Huh? This would be a serious consistency error if this happened for > the existing use case. > This will not happen for existing use case, i.e. device direct assignment. This case is true when there is only mediated device assigned and there are no direct assigned devices. >> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >> uint64_t mask; >> struct vfio_dma *dma; >> unsigned long pfn; >> + struct vfio_domain *domain = NULL; >> >> /* Verify that none of our __u64 fields overflow */ >> if (map->size != size || map->vaddr != vaddr || map->iova != iova) >> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >> /* Insert zero-sized and grow as we map chunks of it */ >> vfio_link_dma(iommu, dma); >> >> + /* >> + * Skip pin and map if and domain list is empty >> + */ >> + if (list_empty(&iommu->domain_list)) { >> + dma->size = size; >> + goto map_done; >> + } > > Again, this would be a serious consistency error for the existing use > case. Let's use indicators that are explicit. > Why? for existing use case (i.e. direct device assignment) domain_list will not be empty, domain_list will only be empty when there is mediated device assigned and no direct device assigned. >> static int vfio_iommu_type1_attach_group(void *iommu_data, >> struct iommu_group *iommu_group) >> { >> struct vfio_iommu *iommu = iommu_data; >> - struct vfio_group *group, *g; >> + struct vfio_group *group; >> struct vfio_domain *domain, *d; >> struct bus_type *bus = NULL; >> int ret; >> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, >> mutex_lock(&iommu->lock); >> >> list_for_each_entry(d, &iommu->domain_list, next) { >> - list_for_each_entry(g, &d->group_list, next) { >> - if (g->iommu_group != iommu_group) >> - continue; >> + if (is_iommu_group_present(d, iommu_group)) { >> + mutex_unlock(&iommu->lock); >> + return -EINVAL; >> + } >> + } >> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE) >> + if (iommu->mediated_domain) { >> + if (is_iommu_group_present(iommu->mediated_domain, >> + iommu_group)) { >> mutex_unlock(&iommu->lock); >> return -EINVAL; >> } >> } >> +#endif >> >> group = kzalloc(sizeof(*group), GFP_KERNEL); >> domain = kzalloc(sizeof(*domain), GFP_KERNEL); >> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, >> if (ret) >> goto out_free; >> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE) >> + if (!iommu_present(bus) && (bus == &mdev_bus_type)) { >> + struct mdev_device *mdev = NULL; > > Unnecessary initialization. > >> + >> + mdev = mdev_get_device_by_group(iommu_group); >> + if (!mdev) >> + goto out_free; >> + >> + mdev->iommu_data = iommu; > > This looks rather sketchy to me, we don't have a mediated driver in > this series, but presumably the driver blindly calls vfio_pin_pages > passing mdev->iommu_data and hoping that it's either NULL to generate > an error or relevant to this iommu backend. How would we add a second > mediated driver iommu backend? We're currently assuming the user > configured this backend. If I understand correctly, your question is if two different mediated devices are assigned to same container. In such case, the two mediated devices will have different iommu_groups and will be added to mediated_domain's group_list (iommu->mediated_domain->group_list). > Should vfio_pin_pages instead have a struct > device* parameter from which we would lookup the iommu_group and get to > the vfio_domain? That's a bit heavy weight, but we need something > along those lines. > There could be multiple mdev devices from same mediated vendor driver in one container. In that case, that vendor driver need reference of container or container->iommu_data to pin and unpin pages. Similarly, there could be mutiple mdev devices from different mediated vendor driver in one container, in that case both vendor driver need reference to container or container->iommu_data to pin and unpin pages in their driver. >> + mdev->iommu_data = iommu; With the above line, a reference to container->iommu_data is kept in mdev structure when the iommu_group is attached to a container so that vendor drivers can find reference to pin and unpin pages. If struct device* is passed as an argument to vfio_pin_pages, to find reference to container of struct device *dev, have to find vfio_device/vfio_group from dev that means traverse vfio.group_list for each pin and unpin call. This list would be long when there are many mdev devices in the system. Is there any better way to find reference to container from struct device*? >> >> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data) >> struct vfio_domain *domain, *domain_tmp; >> struct vfio_group *group, *group_tmp; >> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE) >> + if (iommu->mediated_domain) { >> + domain = iommu->mediated_domain; >> + list_for_each_entry_safe(group, group_tmp, >> + &domain->group_list, next) { >> + if (group->mdev) { >> + group->mdev->iommu_data = NULL; >> + mdev_put_device(group->mdev); >> + } >> + list_del(&group->next); >> + kfree(group); >> + } >> + vfio_iommu_unpin_api_domain(domain); >> + kfree(domain); >> + iommu->mediated_domain = NULL; >> + } >> +#endif > > I'm not really seeing how this is all that much more maintainable than > what was proposed previously, has this aspect been worked on since last > I reviewed this patch? > There aren't many changes from v4 to v5 version of this patch. Can you more specific on you concerns about maintainability? I'll definitely address your concerns. Thanks, Kirti -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html