Re: [PATCH v6 2/3] uacce: add uacce driver

"zhangfei.gao@xxxxxxxxxxx" <zhangfei.gao@xxxxxxxxxxx> · Fri, 25 Oct 2019 10:28:30 +0800

On 2019/10/23 下午3:42, Jean-Philippe Brucker wrote:
On Fri, Oct 18, 2019 at 08:01:44PM +0800, zhangfei.gao@xxxxxxxxxxx wrote:
More generally, it would be nice to use the DMA API when SVA isn't
supported, instead of manually allocating and mapping memory with
iommu_map(). Do we only handcraft these functions in order to have VA ==
IOVA?  On its own it doesn't seem like a strong enough reason to avoid the
DMA API.
Here we use unmanaged domain to prevent va conflict with iova.
The target is still to build shared virtual address though SVA is not
supported.
If SVA isn't supported, having VA == IOVA looks nice but isn't
particularly useful. We could instead require that, if SVA isn't
supported, userspace handles VA and IOVA separately for any DMA region.

Enforcing VA == IOVA adds some unnecessary complexity to this module. In
addition to the special case for software MSIs that is already there
(uacce_iommu_has_sw_msi), it's also not guaranteed that the whole VA space
is representable with IOVAs, you might need to poke holes in the IOVA
space for reserved regions (See iommu.*resv). For example VFIO checks that
the IOVA requested by userspace doesn't fall into a reserved range (see
iova_list in vfio_iommu_type1.c). It also exports to userspace a list of
possible IOVAs through VFIO_IOMMU_GET_INFO.

Letting the DMA API allocate addresses would be simpler, since it already
deals with resv regions and software MSI.

The iova from dma api can be same with va, and device can not distinguish
them.
So here we borrow va from user space and iommu_map to device, and the va
becomes iova.
Since this iova is from user space, so no conflict.
Then dma api can not be used in this case.

drivers/vfio/vfio_iommu_type1.c also use iommu_domain_alloc.
VFIO needs to let userspace pick its IOVA, because the IOVA space is
generally managed by a guest OS. In my opinion this is a baggage that
uacce doesn't need.

If we only supported the DMA API and not unmanaged IOMMU domains,
userspace would need to do a little bit more work by differentiating
between VA and DMA addresses, but that could be abstracted into the uacce
library and it would make the kernel module a lot simpler.
Thanks Jean.
Will only handle sva case for the first patch, then considers the two 
directions further.

[...]
I wish the SVA and !SVA paths were less interleaved. Both models are
fundamentally different:

* Without SVA you cannot share the device between multiple processes. All
    DMA mappings are in the "main", non-PASID address space of the device.

    Note that process isolation without SVA could be achieved with the
    auxiliary domains IOMMU API (introduced primarily for vfio-mdev) but
    this is not the model chosen here.
Does pasid has to be supported for this case?
Yes, you do need PASID support for auxiliary domains, but not PRI/Stall.

[...]
+	/* allocate memory */
+	if (flags & UACCE_QFRF_DMA) {
At the moment UACCE_QFRF_DMA is never set, so there is a lot of unused and
possibly untested code in this file. I think it would be simpler to choose
between either DMA API or unmanaged IOMMU domains and stick with it. As
said before, I'd prefer DMA API.
UACCE_QFRF_DMA is using dma api, it used this for quick method, though it
can not prevent va conflict.
We use an ioctl to get iova of the dma buffer.
Since the interface is not standard, we kept the interface and verified
internally.
As above, it's probably worth exploring this method further for !SVA.

+		qfr->kaddr = dma_alloc_coherent(uacce->pdev,
+						qfr->nr_pages << PAGE_SHIFT,
+						&qfr->dma, GFP_KERNEL);
+		if (!qfr->kaddr) {
+			ret = -ENOMEM;
+			goto err_with_qfr;
+		}
+	} else {
+		ret = uacce_qfr_alloc_pages(qfr);
+		if (ret)
+			goto err_with_qfr;
+	}
+
+	/* map to device */
+	ret = uacce_queue_map_qfr(q, qfr);
Worth moving into the else above.
The idea here is a, map to device, b, map to user space.
Yes but dma_alloc_coherent() creates the IOMMU mapping, and
uacce_queue_map_qfr()'s only task is to create the IOMMU mapping when the
DMA API isn't in use, so you could move this call up, right after
uacce_qfr_alloc_pages().

[...]
+	q->state = UACCE_Q_ZOMBIE;
Since the PUT_Q ioctl makes the queue unrecoverable, why should userspace
invoke it instead of immediately calling close()?
We found close does not release resource immediately, which may cause issue
when re-open again
when all queues are used.
I think the only way to fix that problem is to avoid reallocating the
resources until they are released, because we can't count on userspace to
always call the PUT_Q ioctl. Sometimes the program will crash before that.
Here ioctl PUT_Q is rather an optimization rather than a must solution.
In internal test, we found directly close will takes long delay, at 
worst 1s is taken.
We simulated many processes doing zip and close, then we found  the test 
can not achieve its up-limit.
The reason is  closing fd took delay.
So we consider add put_q ioctl as a quick release method to optimize the 
experience.
And put_q ioctl is optional, user still can check available_instances 
and waiting for close.

+static int uacce_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct uacce_queue *q = filep->private_data;
+	struct uacce_device *uacce = q->uacce;
+	struct uacce_qfile_region *qfr;
+	enum uacce_qfrt type = 0;
+	unsigned int flags = 0;
+	int ret;
+
+	if (vma->vm_pgoff < UACCE_QFRT_MAX)
+		type = vma->vm_pgoff;
+
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+
+	mutex_lock(&uacce_mutex);
By the way, lockdep detects a possible unsafe locking scenario here,
because we're taking the uacce_mutex even though mmap called us with the
mmap_sem held for writing. Conversely uacce_fops_release() takes the
mmap_sem for writing while holding the uacce_mutex. I think it can be
fixed easily, if we simply remove the use of mmap_sem in
uacce_fops_release(), since it's only taken to do some accounting which
doesn't look right.
mmap_sem in uacce_fops_release() will be removed

However, a similar but more complex locking issue comes from the current
use of iommu_sva_bind/unbind_device():

uacce_fops_open:
  iommu_sva_unbind_device()
   iommu_sva_bind_group()	[iommu_group->mutex]
     mmu_notifier_get()		[mmap_sem]

uacce_fops_mmap:		[mmap_sem]
				[uacce_mutex]

uacce_fops_release:
				[uacce_mutex]
   iommu_sva_unbind_device()	[iommu_group->mutex]

This circular dependency can be broken by calling iommu_sva_unbind_device()
outside of uacce_mutex, but I think it's worth reworking the queue locking
scheme a little and use fine-grained locking for the queue state.

Something else I noticed is uacce_idr isn't currently protected. The IDR
API expected the caller to use its own locking scheme. You could replace
it with an xarray, which I think is preferred to IDR now and provides a
xa_lock.
Currently  idr_alloc and idr_remove are simply protected by uacce_mutex,
Will check xarray, looks it is more complicated then idr.

Thanks