Re: [PATCH RFC 1/4] RDMA/umem: Minimize SG table entries

Jason Gunthorpe <jgg@xxxxxxxx> · Thu, 25 Oct 2018 20:27:17 -0600

On Tue, Oct 23, 2018 at 06:55:45PM -0500, Shiraz Saleem wrote:
> On Mon, Oct 22, 2018 at 03:06:42PM -0600, Jason Gunthorpe wrote:
> > On Fri, Oct 19, 2018 at 06:34:06PM -0500, Shiraz Saleem wrote:
> > > Squash contiguous regions of PAGE_SIZE pages
> > > into a single SG entry as opposed to one
> > > SG entry per page. This reduces the SG table
> > > size and is friendliest to the IOMMU.
> > > 
> > > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxx>
> > > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx>
> > > Signed-off-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx>
> > >  drivers/infiniband/core/umem.c | 66 ++++++++++++++++++++----------------------
> > >  1 file changed, 31 insertions(+), 35 deletions(-)
> 
> [..] 
> 
> > > +	page_list = kmalloc_array(npages, sizeof(*page_list),
> > > GFP_KERNEL);
> > 
> > Hurm.. The reason it was __get_free_page and a loop was to avoid a
> > high order allocation triggered by userspace.
> > 
> > ie if I want to make a MR over a 2G region this requires a 4M
> > allocation of page pointers which is too high to do via
> > kmalloc_array().
> > 
> > So I think this can't work, the page list has to be restricted in size
> > and we need to iterate...
> 
> Yes. I had this same concern and this is one of the reasons I posted this patch set as an RFC.
> I was thinking of building a temporary linked-list of page_list arr objects (after pinning)
> each of which could hold a limited set of pages.
> We would then iterate this list to identify contiguous pages and squash them into an SGL entries.

There is no reason to hold all the pages in memory, this can be done
via iteration..

> > So we at most over allocate by 1 page, and if that became critical we
> > could copy the last chain in the sg_table.
> > 
> > If sg_append_table_from_pages() needs more room in the table then it
> > would convert the last entry to a chain and allocate another table.
> >
> 
> Since sg_alloc_table_from_pages cant be really used here as page_list size has
> to be restricted, I would prefer we defer solving the problem of over allocating
> of SG table to a follow on series.
> And do combining of contig regions into SGL entries in this one.

Sure, I don't mind the overallocation, it is already the way things
are today.

Maybe we trivially add an API to trim the sg_table?

> > If the umem core starts producing large SGL entries won't all the
> > drivers break?
> 
> Hmmm..why?

I'm concerned something iterates over the SGL without using the 'for
each page' varient of the iterator.. It is worth doing some survey at
least.

> The code to break down each SGL entry into dma_addr + PAGE_SIZE*i (i
> = 0 to # of PAGE_SIZE pages in SGE) already exists no? For instance,
> https://elixir.bootlin.com/linux/v4.8/source/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c#L918

Like this is a good example, what is the point of this code?

umem->page_shift is always PAGE_SHIFT for umem's that are not ODP so
why isn't this just written as for each for_each_sg_page??

Jason