On Fri, Oct 19, 2018 at 06:34:06PM -0500, Shiraz Saleem wrote: > Squash contiguous regions of PAGE_SIZE pages > into a single SG entry as opposed to one > SG entry per page. This reduces the SG table > size and is friendliest to the IOMMU. > > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxx> > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx> > Signed-off-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> > drivers/infiniband/core/umem.c | 66 ++++++++++++++++++++---------------------- > 1 file changed, 31 insertions(+), 35 deletions(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index c6144df..486d6d7 100644 > +++ b/drivers/infiniband/core/umem.c > @@ -39,6 +39,7 @@ > #include <linux/export.h> > #include <linux/hugetlb.h> > #include <linux/slab.h> > +#include <linux/pagemap.h> > #include <rdma/ib_umem_odp.h> > > #include "uverbs.h" > @@ -46,18 +47,16 @@ > > static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) > { > - struct scatterlist *sg; > + struct sg_page_iter sg_iter; > struct page *page; > - int i; > > if (umem->nmap > 0) > ib_dma_unmap_sg(dev, umem->sg_head.sgl, > - umem->npages, > + umem->sg_head.orig_nents, > DMA_BIDIRECTIONAL); > > - for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { > - > - page = sg_page(sg); > + for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_head.orig_nents, 0) { > + page = sg_page_iter_page(&sg_iter); > if (!PageDirty(page) && umem->writable && dirty) > set_page_dirty_lock(page); > put_page(page); > @@ -92,7 +91,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > int ret; > int i; > unsigned long dma_attrs = 0; > - struct scatterlist *sg, *sg_list_start; > unsigned int gup_flags = FOLL_WRITE; > > if (dmasync) > @@ -138,7 +136,13 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > /* We assume the memory is from hugetlb until proved otherwise */ > umem->hugetlb = 1; > > - page_list = (struct page **) __get_free_page(GFP_KERNEL); > + npages = ib_umem_num_pages(umem); > + if (npages == 0 || npages > UINT_MAX) { > + ret = -EINVAL; > + goto umem_kfree; > + } > + > + page_list = kmalloc_array(npages, sizeof(*page_list), > GFP_KERNEL); Hurm.. The reason it was __get_free_page and a loop was to avoid a high order allocation triggered by userspace. ie if I want to make a MR over a 2G region this requires a 4M allocation of page pointers which is too high to do via kmalloc_array(). So I think this can't work, the page list has to be restricted in size and we need to iterate... > while (npages) { > down_read(&mm->mmap_sem); > ret = get_user_pages_longterm(cur_base, > min_t(unsigned long, npages, > PAGE_SIZE / sizeof (struct page *)), Why keep the min? This was just to match the size of the array.. > - gup_flags, page_list, vma_list); > + gup_flags, page_list + umem->npages, vma_list); > if (ret < 0) { > up_read(&mm->mmap_sem); > - goto umem_release; > + release_pages(page_list, umem->npages); > + goto vma; > } > > umem->npages += ret; > cur_base += ret * PAGE_SIZE; > npages -= ret; > > - /* Continue to hold the mmap_sem as vma_list access > - * needs to be protected. > - */ > - for_each_sg(sg_list_start, sg, ret, i) { > + for(i = 0; i < ret && umem->hugetlb; i++) { > if (vma_list && !is_vm_hugetlb_page(vma_list[i])) > umem->hugetlb = 0; > - > - sg_set_page(sg, page_list[i], PAGE_SIZE, 0); > } > up_read(&mm->mmap_sem); > + } > > - /* preparing for next loop */ > - sg_list_start = sg; > + ret = sg_alloc_table_from_pages(&umem->sg_head, > + page_list, > + umem->npages, > + 0, > + umem->npages << PAGE_SHIFT, > + GFP_KERNEL); Yep, this is nice.. But I wonder if this needs to copy what i915 is doing here in i915_sg_segment_size() and __i915_gem_userptr_alloc_pages() - ie restrict the size if the swiotlb is enabled. Confusing as to why though.. Anyhow, I think for this to be really optimal, some new apis in scatterlist.h will be needed.. Something like a'sg_append_table_from_pages' helper. Then the flow would be: sg_alloc_table(&table, min(npages, SG_MAX_SINGLE_ALLOC)) while (npages) { get_user_pages(page_list, ...) sg_append_table_from_pages(&table, page_list, ..., min(npages, SG_MAX_SINGLE_ALLOC)); npages -= page_list_len; } So we at most over allocate by 1 page, and if that became critical we could copy the last chain in the sg_table. If sg_append_table_from_pages() needs more room in the table then it would convert the last entry to a chain and allocate another table. This bounds the excess memory required and doesn't need a high order allocation to invoke get_user_pages. Also, this series is out of order, right? If the umem core starts producing large SGL entries won't all the drivers break? The drivers need to be able to break them down before this is introduced. Jason