Re: [PATCH v6 2/3] uacce: add uacce driver

Jean-Philippe Brucker <jean-philippe@xxxxxxxxxx> · Wed, 23 Oct 2019 09:42:27 +0200

On Fri, Oct 18, 2019 at 08:01:44PM +0800, zhangfei.gao@xxxxxxxxxxx wrote:
> > More generally, it would be nice to use the DMA API when SVA isn't
> > supported, instead of manually allocating and mapping memory with
> > iommu_map(). Do we only handcraft these functions in order to have VA ==
> > IOVA?  On its own it doesn't seem like a strong enough reason to avoid the
> > DMA API.
> Here we use unmanaged domain to prevent va conflict with iova.
> The target is still to build shared virtual address though SVA is not
> supported.

If SVA isn't supported, having VA == IOVA looks nice but isn't
particularly useful. We could instead require that, if SVA isn't
supported, userspace handles VA and IOVA separately for any DMA region.

Enforcing VA == IOVA adds some unnecessary complexity to this module. In
addition to the special case for software MSIs that is already there
(uacce_iommu_has_sw_msi), it's also not guaranteed that the whole VA space
is representable with IOVAs, you might need to poke holes in the IOVA
space for reserved regions (See iommu.*resv). For example VFIO checks that
the IOVA requested by userspace doesn't fall into a reserved range (see
iova_list in vfio_iommu_type1.c). It also exports to userspace a list of
possible IOVAs through VFIO_IOMMU_GET_INFO.

Letting the DMA API allocate addresses would be simpler, since it already
deals with resv regions and software MSI.

> The iova from dma api can be same with va, and device can not distinguish
> them.
> So here we borrow va from user space and iommu_map to device, and the va
> becomes iova.
> Since this iova is from user space, so no conflict.
> Then dma api can not be used in this case.
> 
> drivers/vfio/vfio_iommu_type1.c also use iommu_domain_alloc.

VFIO needs to let userspace pick its IOVA, because the IOVA space is
generally managed by a guest OS. In my opinion this is a baggage that
uacce doesn't need.

If we only supported the DMA API and not unmanaged IOMMU domains,
userspace would need to do a little bit more work by differentiating
between VA and DMA addresses, but that could be abstracted into the uacce
library and it would make the kernel module a lot simpler.

[...]
> > I wish the SVA and !SVA paths were less interleaved. Both models are
> > fundamentally different:
> > 
> > * Without SVA you cannot share the device between multiple processes. All
> >    DMA mappings are in the "main", non-PASID address space of the device.
> > 
> >    Note that process isolation without SVA could be achieved with the
> >    auxiliary domains IOMMU API (introduced primarily for vfio-mdev) but
> >    this is not the model chosen here.
> Does pasid has to be supported for this case?

Yes, you do need PASID support for auxiliary domains, but not PRI/Stall.

[...]
> > > +	/* allocate memory */
> > > +	if (flags & UACCE_QFRF_DMA) {
> > At the moment UACCE_QFRF_DMA is never set, so there is a lot of unused and
> > possibly untested code in this file. I think it would be simpler to choose
> > between either DMA API or unmanaged IOMMU domains and stick with it. As
> > said before, I'd prefer DMA API.
> UACCE_QFRF_DMA is using dma api, it used this for quick method, though it
> can not prevent va conflict.
> We use an ioctl to get iova of the dma buffer.
> Since the interface is not standard, we kept the interface and verified
> internally.

As above, it's probably worth exploring this method further for !SVA.

> > > +		qfr->kaddr = dma_alloc_coherent(uacce->pdev,
> > > +						qfr->nr_pages << PAGE_SHIFT,
> > > +						&qfr->dma, GFP_KERNEL);
> > > +		if (!qfr->kaddr) {
> > > +			ret = -ENOMEM;
> > > +			goto err_with_qfr;
> > > +		}
> > > +	} else {
> > > +		ret = uacce_qfr_alloc_pages(qfr);
> > > +		if (ret)
> > > +			goto err_with_qfr;
> > > +	}
> > > +
> > > +	/* map to device */
> > > +	ret = uacce_queue_map_qfr(q, qfr);
> > Worth moving into the else above.
> The idea here is a, map to device, b, map to user space.

Yes but dma_alloc_coherent() creates the IOMMU mapping, and
uacce_queue_map_qfr()'s only task is to create the IOMMU mapping when the
DMA API isn't in use, so you could move this call up, right after
uacce_qfr_alloc_pages().

[...]
> > > +	q->state = UACCE_Q_ZOMBIE;
> > Since the PUT_Q ioctl makes the queue unrecoverable, why should userspace
> > invoke it instead of immediately calling close()?
> We found close does not release resource immediately, which may cause issue
> when re-open again
> when all queues are used.

I think the only way to fix that problem is to avoid reallocating the
resources until they are released, because we can't count on userspace to
always call the PUT_Q ioctl. Sometimes the program will crash before that.

> > > +static int uacce_fops_mmap(struct file *filep, struct vm_area_struct *vma)
> > > +{
> > > +	struct uacce_queue *q = filep->private_data;
> > > +	struct uacce_device *uacce = q->uacce;
> > > +	struct uacce_qfile_region *qfr;
> > > +	enum uacce_qfrt type = 0;
> > > +	unsigned int flags = 0;
> > > +	int ret;
> > > +
> > > +	if (vma->vm_pgoff < UACCE_QFRT_MAX)
> > > +		type = vma->vm_pgoff;
> > > +
> > > +	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
> > > +
> > > +	mutex_lock(&uacce_mutex);

By the way, lockdep detects a possible unsafe locking scenario here,
because we're taking the uacce_mutex even though mmap called us with the
mmap_sem held for writing. Conversely uacce_fops_release() takes the
mmap_sem for writing while holding the uacce_mutex. I think it can be
fixed easily, if we simply remove the use of mmap_sem in
uacce_fops_release(), since it's only taken to do some accounting which
doesn't look right.

However, a similar but more complex locking issue comes from the current
use of iommu_sva_bind/unbind_device():

uacce_fops_open:
 iommu_sva_unbind_device()
  iommu_sva_bind_group()	[iommu_group->mutex]
    mmu_notifier_get()		[mmap_sem]

uacce_fops_mmap:		[mmap_sem]
				[uacce_mutex]

uacce_fops_release:
				[uacce_mutex]
  iommu_sva_unbind_device()	[iommu_group->mutex]

This circular dependency can be broken by calling iommu_sva_unbind_device()
outside of uacce_mutex, but I think it's worth reworking the queue locking
scheme a little and use fine-grained locking for the queue state.

Something else I noticed is uacce_idr isn't currently protected. The IDR
API expected the caller to use its own locking scheme. You could replace
it with an xarray, which I think is preferred to IDR now and provides a
xa_lock.

Thanks,
Jean