Re: [PATCH RFC 3/4] RDMA/umem: Add API to return optimal HW DMA addresses from SG list

Shiraz Saleem <shiraz.saleem@xxxxxxxxx> · Thu, 25 Oct 2018 17:20:43 -0500

On Mon, Oct 22, 2018 at 03:40:20PM -0600, Jason Gunthorpe wrote:
> On Fri, Oct 19, 2018 at 06:34:08PM -0500, Shiraz Saleem wrote:
> > This helper iterates the SG list and returns suitable
> > HW aligned DMA addresses within a driver supported
> > page size. The implementation is intended to work for
> > HW that support single page sizes or mixed page sizes
> > in an MR. This avoids the need for having driver specific
> > algorithms to achieve the same thing and redundant walks
> > of the SG list.
> >

[..]
> >  
> > +void ib_umem_start_phys_iter(struct ib_umem *umem,
> > +			     struct sg_phys_iter *sg_phys_iter)
> > +{
> > +	memset(sg_phys_iter, 0, sizeof(struct sg_phys_iter));
> > +	sg_phys_iter->sg = umem->sg_head.sgl;
> > +}
> > +EXPORT_SYMBOL(ib_umem_start_phys_iter);
> 
> I think the supported_pgsz should be an input here, it doesn't make
> sense to change the sizes during iteration, does it? Would it even
> work right?

Yes. The sizes shouldnt change during iteration. I ll move it here.

> > +
> > +/**
> > + * ib_umem_next_phys_iter - SG list iterator that returns aligned HW address
> > + * @umem: umem struct
> > + * @sg_phys_iter: SG HW address iterator
> > + * @supported_pgsz: bitmask of HW supported page sizes
> > + *
> > + * This helper iterates over the SG list and returns the HW
> > + * address aligned to a supported HW page size.
> > + *
> > + * The algorithm differs slightly between HW that supports single
> > + * page sizes vs mixed page sizes in an MR. For example, if an
> > + * MR of size 4M-4K, starts at an offset PAGE_SIZE (ex: 4K) into
> > + * a 2M page; HW that supports multiple page sizes (ex: 4K, 2M)
> > + * would get 511 4K pages and one 2M page. Single page support
> > + * HW would get back two 2M pages or 1023 4K pages.
> 
> That doesn't seem quite right, the first and last pages should always
> be the largest needed to get to alignment, as HW always has an
> offset/length sort of scheme to make this efficient.
> 
> So I would expect 4M-4k to always return two 2M pages if the HW
> supports 2M, even if supports smaller sizes.
> 
> For this to work the algorithm needs to know start/end...
> 
> Maybe we should not support multiple page sizes yet, do we have HW
> that can even do that?

Sorry. I think my comment is confusing.
There is a typo in my description. Its not multiple page size but mixed
page size. In which case, I think it should be 511 4K and one 2M page
for my example of 4M-4k. This is how I understood it based on your
description here.
https://patchwork.kernel.org/patch/10499753/#22186847

For HW like i40iw which can do a single PG size in an MR, we would call
ib_umem_find_single_pg_size first and it should return 2 2M pages, except
if we have to reduce the page size to 4K due to start offset misalignment
between user-space and physical buf. This algorithm would just use best page
size passed in.

> > +	unsigned long pg_mask, offset;
> > +	int pg_bit;
> 
> unsigned

OK.

> > +
> > +	if (!sg_phys_iter->sg || !supported_pgsz)
> > +		return false;
> 
> > +	if (sg_phys_iter->remaining) {
> > +		sg_phys_iter->phyaddr += sg_phys_iter->len;
> > +	} else {
> > +		sg_phys_iter->phyaddr = sg_dma_address(sg_phys_iter->sg);
> > +		sg_phys_iter->remaining = sg_dma_len(sg_phys_iter->sg);
> 
> Why not put the
> 		sg_phys_iter->sg = sg_next(sg_phys_iter->sg);
> 
> Here?

I think we might have a problem moving it here. For example if we had an MR composing of
3 contiguous 4K pages starting at 4K boundary. That would be represented by a single SGE.

> +	sg_phys_iter->len = min_t(int, sg_phys_iter->remaining,
> +				  (1 << (pg_bit)) - offset);
> +	sg_phys_iter->remaining -= sg_phys_iter->len;

When we execute this code; pg_bit = 12, sg_phys_iter->len = 4K, sg_phys_iter->remaining = 8K.
Since we moved bumping of sg_phys_iter->sg here, it would be NULL on next iteration and we would
exit without processing the remaining 8K from the SGE.

> > +	}
> > +
> > +	/* Single page support in MR */
> > +	if (hweight_long(supported_pgsz) == 1) {
> > +		pg_bit = fls64(supported_pgsz) - 1;
> 
> Then this could be precomputed once
> 
> > +	} else {
> > +		/* Mixed page support in MR*/
> > +		pg_bit = ib_umem_find_pg_bit(sg_phys_iter->phyaddr,
> > +					     supported_pgsz);
> > +	}
> > +
> > +	/* page bit cannot be less than the lowest supported HW size */
> > +	if (WARN_ON(pg_bit < __ffs(supported_pgsz)))
> > +		return false;
> 
> and this

I think there is opportunity to pre-compute, especially for single page support case.

Shiraz