Re: [PATCH v7 4/5] RDMA/mlx5: Support dma-buf based userspace memory region

Jason Gunthorpe <jgg@xxxxxxxx> · Wed, 4 Nov 2020 20:07:21 -0400

On Wed, Nov 04, 2020 at 02:06:34PM -0800, Jianxin Xiong wrote:
> +	umem = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, access_flags,
> +				  &mlx5_ib_dmabuf_attach_ops);
> +	if (IS_ERR(umem)) {
> +		mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
> +		return ERR_PTR(PTR_ERR(umem));
> +	}
> +
> +	mr = alloc_mr_from_cache(pd, umem, virt_addr, access_flags);

It is very subtle, but this calls mlx5_umem_find_best_pgsz() which
calls ib_umem_find_best_pgsz() which goes over the SGL to determine
the page size to use.

As part of this it does validation of the IOVA vs first page offset vs
first page dma address. These little details come into play if the
IOVA and offset are not PAGE_SIZE aligned, which is very possible if
the dma buf exporter or system PAGE_SIZE is over 4k.

In other words, the dma_address of the first SGL must be the page
aligned starting point of the MR. Since the 'skip' approach is being
done when breaking the SGL into blocks the ib_umem_find_best_pgsz()
sees an invalid page size.

Slicing it has to be done in a way that gives a properly formed
SGL. 

My suggestion is to just change the SGL in place. Iterate to the
starting SGE in the SGL and assign it to the sg table, modify it to
have a offset dma_address and reduced length

Count the number of SGEs to span the remaning range and use that as
the new nmaped

Edit the last SGE to have a reduced length

Upon unmap undo the edits so the exporter doesn't see the mangled SGL.

It would be saner if the exporter could do this, but oh well.

Approximately like this:

	struct ib_umem *umem = &umem_p->umem;
	struct scatterlist *sg;
	int i;

	for_each_sg(umem_p->umem.sg_head.sgl, sg, umem_p->umem.nmap, i) {
		if (cur + sg_dma_len(sg) > ALIGN_DOWN(umem->address, PAGE_SIZE)) {
			unsigned long offset;

			umem_p->first_sg = sg;
			umem_p->first_dma_address = sg->dma_address;
			umem_p->first_dma_length = sg_dma_len(sg);
			umem_p->first_length = sg->length;
			offset = ALIGN_DOWN(umem->addressm PAGE_SIZE) - cur;
			sg->dma_address += offset;
			sg_dma_len(sg) -= offset;
			sg->length -= offset;
		}
		if (ALIGN(umem->address + umem->length, PAGE_SIZE) < cur + sg_dma_len(sg)) {
			unsigned long trim;

			umem_p->last_sg = sg;
			umem_p->last_dma_length = sg_dma_len(sg);
			umem_p->last_length = sg->length;
			trim =  cur + sg_dma_len(sg) - ALIGN(umem->address + umem->length, PAGE_SIZE);
			sg_dma_len(sg) -= trim;
			sg->length -= trim;
			return npages;
		}
                cur += sg_dma_len(sg);
	}
        /* It is essential that the length of the SGL exactly match
  	   the adjusted page aligned length of umem->length */
	return -EINVAL;

Further, this really only works if the umem->page_size is locked to 4k
because this doesn't have code to resize the MKEY, or change the
underlying page size when the SGL changes.

So, I'd say put something like the above in the core code to validate
and properly form the umem->sgl

Then modify the alloc_mr_from_cache to use only PAGE_SIZE:

 if (umem->is_dma_buf)
        page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, iova);
 else
    	page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 0, iova);

Jason