Re: [PATCH rdma-next v4 10/12] RDMA/efa: Add EFA verbs implementation

Jason Gunthorpe <jgg@xxxxxxxx> · Thu, 28 Mar 2019 14:27:55 -0300

On Thu, Mar 28, 2019 at 07:20:33PM +0200, Gal Pressman wrote:
> On 28-Mar-19 15:08, Jason Gunthorpe wrote:
> > On Thu, Mar 28, 2019 at 02:39:30PM +0200, Gal Pressman wrote:
> > 
> >> +static int umem_to_page_list(struct efa_dev *dev,
> >> +			     struct ib_umem *umem,
> >> +			     u64 *page_list,
> >> +			     u32 hp_cnt,
> >> +			     u8 hp_shift)
> >> +{
> >> +	u32 pages_in_hp = BIT(hp_shift - PAGE_SHIFT);
> >> +	unsigned int page_idx = 0;
> >> +	unsigned int hp_idx = 0;
> >> +	struct scatterlist *sg;
> >> +	unsigned int entry;
> >> +
> >> +	if (umem->page_shift != PAGE_SHIFT) {
> >> +		ibdev_dbg(&dev->ibdev,
> >> +			  "umem invalid page shift %d\n", umem->page_shift);
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	ibdev_dbg(&dev->ibdev, "hp_cnt[%u], pages_in_hp[%u]\n",
> >> +		  hp_cnt, pages_in_hp);
> >> +
> >> +	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
> >> +		if (sg_dma_len(sg) != PAGE_SIZE) {
> >> +			ibdev_dbg(&dev->ibdev,
> >> +				  "sg_dma_len[%u] != PAGE_SIZE[%lu]\n",
> >> +				  sg_dma_len(sg), PAGE_SIZE);
> >> +			return -EINVAL;
> >> +		}
> > 
> > Again, this needs to work with Shiraz's work, driver's can't assume the
> > umem sgl has PAGE_SIZE chunks. Use for_each_sg_dma_page
> 
> Thanks, I have a couple of questions:
> 
> I'm not entirely up to date with Shiraz's work, wanna make sure I understand:
> Currently the umem sgl is built out of PAGE_SIZE elements. The reason we can't
> assume each element is PAGE_SIZE long is because the DMA mapping (iommu) might
> combine multiple elements or because Shiraz's work is going to combine the
> elements manually?

That is the existing reason, but this patch:

https://patchwork.kernel.org/patch/10875419/

Just directly makes larger SGL.

> Reading the documentation, sg_page_iter kinda looks like it covers
> sg_dma_page_iter functionality? Don't they both retrieve the DMA
> address page by page? Looks like the only difference is that in
> sg_page_iter I can get the struct page as well?

No.

A SGL contains two lists of different size, a 'dma mapped' list and a
'CPU page list'. The act of DMA mapping creates the first list.

These lists cannot be intermixed, if you run over the CPU page list
you can't get DMA addresses, and vice versa.

> >> +			   unsigned long max_page_shift,
> >> +			   int *count, u8 *shift, u32 *ncont)
> >> +{
> >> +	unsigned long page_shift = umem->page_shift;
> >> +	struct scatterlist *sg;
> >> +	u64 base = ~0, p = 0;
> >> +	unsigned long tmp;
> >> +	unsigned long m;
> >> +	u64 len, pfn;
> >> +	int i = 0;
> >> +	int entry;
> >> +
> >> +	addr = addr >> page_shift;
> >> +	tmp = (unsigned long)addr;
> >> +	m = find_first_bit(&tmp, BITS_PER_LONG);
> >> +	if (max_page_shift)
> >> +		m = min_t(unsigned long, max_page_shift - page_shift, m);
> >> +
> >> +	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
> >> +		len = sg_dma_len(sg) >> page_shift;
> > 
> > Shouldn't assume the offset is zero here.
> 
> Because elements are not necessarily PAGE_SIZE?
> If page combining is already done by ib_core, this whole function can probably
> be deleted.

If you test and ack the patch above then page combining will be done
by the ib_core tomorrow :)

> > This whole function should really better match what Shiraz is
> > doing. It looks like it is just ib_umem_find_single_pg_size() ??
> 
> Yea, I think ib_umem_find_single_pg_size() could probably replace this whole
> function, this work isn't merged yet though right?

Not yes, but if you test it and say it works and you need it, then it
can accelerate.

Jason