Re: [PATCH rdma-next v1 1/2] RDMA/umem: Don't hold mmap_sem for too long

Leon Romanovsky <leon@xxxxxxxxxx> · Wed, 4 Jul 2018 10:34:24 +0300

On Tue, Jul 03, 2018 at 04:14:58PM -0600, Jason Gunthorpe wrote:
> On Sun, Jul 01, 2018 at 03:48:08PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> >
> > DMA mapping is time consuming operation and doesn't need to be
> > performed with mmap_sem semaphore is held on. Limit the scope of
> > that semaphore for accounting part only.
> >
> > Signed-off-by: Huy Nguyen <huyn@xxxxxxxxxxxx>
> > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > ---
> >  drivers/infiniband/core/umem.c | 22 +++++++++++++---------
> >  1 file changed, 13 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > index 498f59bb4989..5c7ba60848d5 100644
> > --- a/drivers/infiniband/core/umem.c
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -84,7 +84,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >  	struct ib_umem *umem;
> >  	struct page **page_list;
> >  	struct vm_area_struct **vma_list;
> > -	unsigned long locked;
> >  	unsigned long lock_limit;
> >  	unsigned long cur_base;
> >  	unsigned long npages;
> > @@ -149,15 +148,16 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >
> >  	npages = ib_umem_num_pages(umem);
> >
> > -	down_write(&current->mm->mmap_sem);
> > -
> > -	locked     = npages + current->mm->pinned_vm;
> >  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >
> > -	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> > +	down_write(&current->mm->mmap_sem);
> > +	current->mm->pinned_vm += npages;
> > +	if ((current->mm->pinned_vm > lock_limit) && !capable(CAP_IPC_LOCK)) {
> > +		up_write(&current->mm->mmap_sem);
> >  		ret = -ENOMEM;
> >  		goto out;
> >  	}
> > +	up_write(&current->mm->mmap_sem);
> >
> >  	cur_base = addr & PAGE_MASK;
> >
> > @@ -177,11 +177,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >  	sg_list_start = umem->sg_head.sgl;
> >
> >  	while (npages) {
> > +		down_read(&current->mm->mmap_sem);
> >  		ret = get_user_pages_longterm(cur_base,
> >  				     min_t(unsigned long, npages,
> >  					   PAGE_SIZE / sizeof (struct page *)),
> >  				     gup_flags, page_list, vma_list);
>
> This is a bit wrong these days.. Now that this is
> get_user_pages_longterm() a NULL vma_list causes it to allocate, so
> the code before isn't right:
>
> 	/*
> 	 * if we can't alloc the vma_list, it's not so bad;
> 	 * just assume the memory is not hugetlb memory
> 	 */
> 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
> 	if (!vma_list)
>
> As this now should be a hard failure.
>
> Arguably this should probably just split the page_list allocation in
> half and use half for vmas and half for pages..
>
> Not related to this patch, but since you are tidying this area..

get_user_pages_longterm() doesn't try to allocate whole page, but exact
number of memory upto PAGE_SIZE, which according to code comment above can
be less than __get_free_page() returns. However, I failed to find proof
for that comment and it seems that __get_free_page() will always return page
in the size of PAGE_SIZE.

I would recommend to do proposed change in different patch and only
after i40iw and bnxt_re drivers will be fixed to do not use hugetlb.

>
> > -
> > +		up_read(&current->mm->mmap_sem);
> >  		if (ret < 0)
> >  			goto out;
> >
> > @@ -189,12 +190,14 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >  		cur_base += ret * PAGE_SIZE;
> >  		npages   -= ret;
> >
> > +		down_read(&current->mm->mmap_sem);
> >  		for_each_sg(sg_list_start, sg, ret, i) {
>
> This seems placed wrong, the up_read was only two lines higher. Surely
> the entire 'while(npages)' should be holding the mmap_sem? Or at least
> each single iteration.. Maybe the loop needs to be moved into a
> function.

Not really, the down_read() protects access to is_vm_hugetlb_page(vma_list[i]
check. We don't need to protect npages because they can't be changed
concurrently, in opposite to VMA and pages.

>
> >  			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
> >  				umem->hugetlb = 0;
>
> But, what does this even get used for? Only bnxt_re and i40iw_verbs
> even read this output, and I'm rather skeptical that those drivers are
> using it in a proper way.
>
> The code in drivers/infiniband/hw/bnxt_re/ib_verbs.c around this sure
> looks wrong.
>
> And I have no idea what the code in i40iw_set_hugetlb_values() thinks
> it is doing calling find_vma() without locking, or using that result
> to make assumptions about the whole page array. That is clearly wrong
> too.
>
> Both drivers need to use the page array physical address based
> algorithm like the other drivers use.
>
> Arguably we should be doing a better job of this in core code, I
> think. Hopefuly the two driver authors (cc'd) will fix their drivers
> and we can delete the hugetlb value..

+1

>
> Jason
Attachment:
signature.asc

Description: PGP signature