Re: [PATCH rdma-next v1 1/2] RDMA/umem: Don't hold mmap_sem for too long

Jason Gunthorpe <jgg@xxxxxxxx> · Tue, 3 Jul 2018 16:14:58 -0600

On Sun, Jul 01, 2018 at 03:48:08PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> 
> DMA mapping is time consuming operation and doesn't need to be
> performed with mmap_sem semaphore is held on. Limit the scope of
> that semaphore for accounting part only.
> 
> Signed-off-by: Huy Nguyen <huyn@xxxxxxxxxxxx>
> Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> ---
>  drivers/infiniband/core/umem.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 498f59bb4989..5c7ba60848d5 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -84,7 +84,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  	struct ib_umem *umem;
>  	struct page **page_list;
>  	struct vm_area_struct **vma_list;
> -	unsigned long locked;
>  	unsigned long lock_limit;
>  	unsigned long cur_base;
>  	unsigned long npages;
> @@ -149,15 +148,16 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  
>  	npages = ib_umem_num_pages(umem);
>  
> -	down_write(&current->mm->mmap_sem);
> -
> -	locked     = npages + current->mm->pinned_vm;
>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  
> -	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> +	down_write(&current->mm->mmap_sem);
> +	current->mm->pinned_vm += npages;
> +	if ((current->mm->pinned_vm > lock_limit) && !capable(CAP_IPC_LOCK)) {
> +		up_write(&current->mm->mmap_sem);
>  		ret = -ENOMEM;
>  		goto out;
>  	}
> +	up_write(&current->mm->mmap_sem);
>  
>  	cur_base = addr & PAGE_MASK;
>  
> @@ -177,11 +177,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  	sg_list_start = umem->sg_head.sgl;
>  
>  	while (npages) {
> +		down_read(&current->mm->mmap_sem);
>  		ret = get_user_pages_longterm(cur_base,
>  				     min_t(unsigned long, npages,
>  					   PAGE_SIZE / sizeof (struct page *)),
>  				     gup_flags, page_list, vma_list);

This is a bit wrong these days.. Now that this is
get_user_pages_longterm() a NULL vma_list causes it to allocate, so
the code before isn't right:

	/*
	 * if we can't alloc the vma_list, it's not so bad;
	 * just assume the memory is not hugetlb memory
	 */
	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
	if (!vma_list)

As this now should be a hard failure.

Arguably this should probably just split the page_list allocation in
half and use half for vmas and half for pages..

Not related to this patch, but since you are tidying this area..

> -
> +		up_read(&current->mm->mmap_sem);
>  		if (ret < 0)
>  			goto out;
>  
> @@ -189,12 +190,14 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		cur_base += ret * PAGE_SIZE;
>  		npages   -= ret;
>
> +		down_read(&current->mm->mmap_sem);
>  		for_each_sg(sg_list_start, sg, ret, i) {

This seems placed wrong, the up_read was only two lines higher. Surely
the entire 'while(npages)' should be holding the mmap_sem? Or at least
each single iteration.. Maybe the loop needs to be moved into a
function.

>  			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
>  				umem->hugetlb = 0;

But, what does this even get used for? Only bnxt_re and i40iw_verbs
even read this output, and I'm rather skeptical that those drivers are
using it in a proper way.

The code in drivers/infiniband/hw/bnxt_re/ib_verbs.c around this sure
looks wrong.

And I have no idea what the code in i40iw_set_hugetlb_values() thinks
it is doing calling find_vma() without locking, or using that result
to make assumptions about the whole page array. That is clearly wrong
too.

Both drivers need to use the page array physical address based
algorithm like the other drivers use.

Arguably we should be doing a better job of this in core code, I
think. Hopefuly the two driver authors (cc'd) will fix their drivers
and we can delete the hugetlb value..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html