Re: [PATCH RFC 3/4] RDMA/umem: Add API to return optimal HW DMA addresses from SG list

Jason Gunthorpe <jgg@xxxxxxxx> · Mon, 22 Oct 2018 15:40:20 -0600

On Fri, Oct 19, 2018 at 06:34:08PM -0500, Shiraz Saleem wrote:
> This helper iterates the SG list and returns suitable
> HW aligned DMA addresses within a driver supported
> page size. The implementation is intended to work for
> HW that support single page sizes or mixed page sizes
> in an MR. This avoids the need for having driver specific
> algorithms to achieve the same thing and redundant walks
> of the SG list.
> 
> Suggested-by: Jason Gunthorpe <jgg@xxxxxxxx>
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx>
> Signed-off-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx>
>  drivers/infiniband/core/umem.c | 68 ++++++++++++++++++++++++++++++++++++++++++
>  include/rdma/ib_umem.h         | 19 ++++++++++++
>  2 files changed, 87 insertions(+)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 04071b5..cba79ab 100644
> +++ b/drivers/infiniband/core/umem.c
> @@ -160,6 +160,74 @@ unsigned long ib_umem_find_single_pg_size(struct ib_umem *umem,
>  }
>  EXPORT_SYMBOL(ib_umem_find_single_pg_size);
>  
> +void ib_umem_start_phys_iter(struct ib_umem *umem,
> +			     struct sg_phys_iter *sg_phys_iter)
> +{
> +	memset(sg_phys_iter, 0, sizeof(struct sg_phys_iter));
> +	sg_phys_iter->sg = umem->sg_head.sgl;
> +}
> +EXPORT_SYMBOL(ib_umem_start_phys_iter);

I think the supported_pgsz should be an input here, it doesn't make
sense to change the sizes during iteration, does it? Would it even
work right?

> +
> +/**
> + * ib_umem_next_phys_iter - SG list iterator that returns aligned HW address
> + * @umem: umem struct
> + * @sg_phys_iter: SG HW address iterator
> + * @supported_pgsz: bitmask of HW supported page sizes
> + *
> + * This helper iterates over the SG list and returns the HW
> + * address aligned to a supported HW page size.
> + *
> + * The algorithm differs slightly between HW that supports single
> + * page sizes vs mixed page sizes in an MR. For example, if an
> + * MR of size 4M-4K, starts at an offset PAGE_SIZE (ex: 4K) into
> + * a 2M page; HW that supports multiple page sizes (ex: 4K, 2M)
> + * would get 511 4K pages and one 2M page. Single page support
> + * HW would get back two 2M pages or 1023 4K pages.

That doesn't seem quite right, the first and last pages should always
be the largest needed to get to alignment, as HW always has an
offset/length sort of scheme to make this efficient.

So I would expect 4M-4k to always return two 2M pages if the HW
supports 2M, even if supports smaller sizes.

For this to work the algorithm needs to know start/end...

Maybe we should not support multiple page sizes yet, do we have HW
that can even do that?

> + */
> +bool ib_umem_next_phys_iter(struct ib_umem *umem,
> +			    struct sg_phys_iter *sg_phys_iter,
> +			    unsigned long supported_pgsz)
> +{
> +	unsigned long pg_mask, offset;
> +	int pg_bit;

unsigned

> +
> +	if (!sg_phys_iter->sg || !supported_pgsz)
> +		return false;

> +	if (sg_phys_iter->remaining) {
> +		sg_phys_iter->phyaddr += sg_phys_iter->len;
> +	} else {
> +		sg_phys_iter->phyaddr = sg_dma_address(sg_phys_iter->sg);
> +		sg_phys_iter->remaining = sg_dma_len(sg_phys_iter->sg);

Why not put the
		sg_phys_iter->sg = sg_next(sg_phys_iter->sg);

Here?

> +	}
> +
> +	/* Single page support in MR */
> +	if (hweight_long(supported_pgsz) == 1) {
> +		pg_bit = fls64(supported_pgsz) - 1;

Then this could be precomputed once

> +	} else {
> +		/* Mixed page support in MR*/
> +		pg_bit = ib_umem_find_pg_bit(sg_phys_iter->phyaddr,
> +					     supported_pgsz);
> +	}
> +
> +	/* page bit cannot be less than the lowest supported HW size */
> +	if (WARN_ON(pg_bit < __ffs(supported_pgsz)))
> +		return false;

and this

> +
> +	pg_mask = ~((1 << pg_bit) - 1);
> +
> +	offset = sg_phys_iter->phyaddr & ~pg_mask;
> +	sg_phys_iter->phyaddr = sg_phys_iter->phyaddr & pg_mask;
> +	sg_phys_iter->len = min_t(int, sg_phys_iter->remaining,
> +				  (1 << (pg_bit)) - offset);
> +	sg_phys_iter->remaining -= sg_phys_iter->len;
> +	if (!sg_phys_iter->remaining)
> +		sg_phys_iter->sg = sg_next(sg_phys_iter->sg);
> +
> +	return true;
> +}
> +EXPORT_SYMBOL(ib_umem_next_phys_iter);