Re: [PATCH RFC 1/4] RDMA/umem: Minimize SG table entries

Shiraz Saleem <shiraz.saleem@xxxxxxxxx> · Tue, 23 Oct 2018 18:55:45 -0500

On Mon, Oct 22, 2018 at 03:06:42PM -0600, Jason Gunthorpe wrote:
> On Fri, Oct 19, 2018 at 06:34:06PM -0500, Shiraz Saleem wrote:
> > Squash contiguous regions of PAGE_SIZE pages
> > into a single SG entry as opposed to one
> > SG entry per page. This reduces the SG table
> > size and is friendliest to the IOMMU.
> > 
> > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxx>
> > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx>
> > Signed-off-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx>
> >  drivers/infiniband/core/umem.c | 66 ++++++++++++++++++++----------------------
> >  1 file changed, 31 insertions(+), 35 deletions(-)

[..] 

> > +	page_list = kmalloc_array(npages, sizeof(*page_list),
> > GFP_KERNEL);
> 
> Hurm.. The reason it was __get_free_page and a loop was to avoid a
> high order allocation triggered by userspace.
> 
> ie if I want to make a MR over a 2G region this requires a 4M
> allocation of page pointers which is too high to do via
> kmalloc_array().
> 
> So I think this can't work, the page list has to be restricted in size
> and we need to iterate...

Yes. I had this same concern and this is one of the reasons I posted this patch set as an RFC.
I was thinking of building a temporary linked-list of page_list arr objects (after pinning)
each of which could hold a limited set of pages.
We would then iterate this list to identify contiguous pages and squash them into an SGL entries.

[..]
> > +	ret = sg_alloc_table_from_pages(&umem->sg_head,
> > +					page_list,
> > +					umem->npages,
> > +					0,
> > +					umem->npages << PAGE_SHIFT,
> > +					GFP_KERNEL);
> 
> Yep, this is nice.. But I wonder if this needs to copy what i915 is
> doing here in i915_sg_segment_size() and
> __i915_gem_userptr_alloc_pages() - ie restrict the size if the swiotlb
> is enabled. Confusing as to why though..
> 
> 
> Anyhow, I think for this to be really optimal, some new apis in
> scatterlist.h will be needed.. Something like
> a'sg_append_table_from_pages' helper.
> 
> Then the flow would be:
> 
> sg_alloc_table(&table, min(npages, SG_MAX_SINGLE_ALLOC))
> while (npages) {
>    get_user_pages(page_list, ...)
>    sg_append_table_from_pages(&table, page_list, ...,
>        min(npages, SG_MAX_SINGLE_ALLOC));
>    npages -= page_list_len;
> }
> 
> So we at most over allocate by 1 page, and if that became critical we
> could copy the last chain in the sg_table.
> 
> If sg_append_table_from_pages() needs more room in the table then it
> would convert the last entry to a chain and allocate another table.
>

Since sg_alloc_table_from_pages cant be really used here as page_list size has
to be restricted, I would prefer we defer solving the problem of over allocating
of SG table to a follow on series.
And do combining of contig regions into SGL entries in this one.

> This bounds the excess memory required and doesn't need a high order
> allocation to invoke get_user_pages.
> 
> 
> Also, this series is out of order, right?
> 
> If the umem core starts producing large SGL entries won't all the
> drivers break?

Hmmm..why?

The code to break down each SGL entry into dma_addr + PAGE_SIZE*i (i = 0 to # of PAGE_SIZE pages in SGE)
already exists no? For instance,
https://elixir.bootlin.com/linux/v4.8/source/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c#L918

> The drivers need to be able to break them down before this is
> introduced.
> 
> Jason