On Sun, Jul 01, 2018 at 03:48:08PM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@xxxxxxxxxxxx> > > DMA mapping is time consuming operation and doesn't need to be > performed with mmap_sem semaphore is held on. Limit the scope of > that semaphore for accounting part only. > > Signed-off-by: Huy Nguyen <huyn@xxxxxxxxxxxx> > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx> > --- > drivers/infiniband/core/umem.c | 22 +++++++++++++--------- > 1 file changed, 13 insertions(+), 9 deletions(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index 498f59bb4989..5c7ba60848d5 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -84,7 +84,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > struct ib_umem *umem; > struct page **page_list; > struct vm_area_struct **vma_list; > - unsigned long locked; > unsigned long lock_limit; > unsigned long cur_base; > unsigned long npages; > @@ -149,15 +148,16 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > > npages = ib_umem_num_pages(umem); > > - down_write(¤t->mm->mmap_sem); > - > - locked = npages + current->mm->pinned_vm; > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > - if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) { > + down_write(¤t->mm->mmap_sem); > + current->mm->pinned_vm += npages; > + if ((current->mm->pinned_vm > lock_limit) && !capable(CAP_IPC_LOCK)) { > + up_write(¤t->mm->mmap_sem); > ret = -ENOMEM; > goto out; > } > + up_write(¤t->mm->mmap_sem); > > cur_base = addr & PAGE_MASK; > > @@ -177,11 +177,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > sg_list_start = umem->sg_head.sgl; > > while (npages) { > + down_read(¤t->mm->mmap_sem); > ret = get_user_pages_longterm(cur_base, > min_t(unsigned long, npages, > PAGE_SIZE / sizeof (struct page *)), > gup_flags, page_list, vma_list); This is a bit wrong these days.. Now that this is get_user_pages_longterm() a NULL vma_list causes it to allocate, so the code before isn't right: /* * if we can't alloc the vma_list, it's not so bad; * just assume the memory is not hugetlb memory */ vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL); if (!vma_list) As this now should be a hard failure. Arguably this should probably just split the page_list allocation in half and use half for vmas and half for pages.. Not related to this patch, but since you are tidying this area.. > - > + up_read(¤t->mm->mmap_sem); > if (ret < 0) > goto out; > > @@ -189,12 +190,14 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > cur_base += ret * PAGE_SIZE; > npages -= ret; > > + down_read(¤t->mm->mmap_sem); > for_each_sg(sg_list_start, sg, ret, i) { This seems placed wrong, the up_read was only two lines higher. Surely the entire 'while(npages)' should be holding the mmap_sem? Or at least each single iteration.. Maybe the loop needs to be moved into a function. > if (vma_list && !is_vm_hugetlb_page(vma_list[i])) > umem->hugetlb = 0; But, what does this even get used for? Only bnxt_re and i40iw_verbs even read this output, and I'm rather skeptical that those drivers are using it in a proper way. The code in drivers/infiniband/hw/bnxt_re/ib_verbs.c around this sure looks wrong. And I have no idea what the code in i40iw_set_hugetlb_values() thinks it is doing calling find_vma() without locking, or using that result to make assumptions about the whole page array. That is clearly wrong too. Both drivers need to use the page array physical address based algorithm like the other drivers use. Arguably we should be doing a better job of this in core code, I think. Hopefuly the two driver authors (cc'd) will fix their drivers and we can delete the hugetlb value.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html