Re: [PATCH rdma-next] RDMA/i40w: Hold read semaphore while looking after VMA

Jason Gunthorpe <jgg@xxxxxxxxxxxx> · Tue, 31 Jul 2018 08:31:10 -0600

On Tue, Jul 31, 2018 at 09:19:31AM -0500, Shiraz Saleem wrote:
> On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote:
> > On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> > > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > > > From: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > > 
> > > > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > > > 
> > > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > >
> > > > > > > Thanks for fixing this bug.
> > > > > > > 
> > > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> 
> > > > > > 
> > > > > > This code in the driver all looks wrong to me, it should not rely on
> > > > > > the huge pages flag and it shouldn't do this VM walking.
> > > > > >
> > > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > > > scanned to make this determination in ib_umem_get.
> > > > > Isnt it better to use it than implement it in individual driver.
> > > > 
> > > > It was a mistake to have this flag and all drivers except i40iw and
> > > > bnxt don't use it. I wish to remove it.
> > > I see. What was the drawback to the approach?
> > 
> > It is just wongly designed.. page merging is about combining physical
> > addresses within a HW supported size list, not about matching the CPU
> > huge page size(s) to HW.
> 
> Well ok. That is a different way of looking at it.
> Our assumption was tied to the fact that the OS makes
> available only a few page sizes (depending on the CPU arch) and
> huge pages fall in that bracket. So it made sense for it be
> supported by core. But your approach seems to give more
> flexibility and gets rid of the locking. 

There are various patches that produce contiguous physical allocations
in various ways beyond the huge page size restriction. Drivers must
look at the physical pages to take advantage of that.

> > We should never even care about 'huge pages' in the VM sense, and
> > certainly never scan a VM, and it is buggy/racy to do so, I
> > think. Leon already noticed missing locking, but I think there is
> > worse here since we have desync'd with the lock that get_user_pages
> > was run under.
> 
> OK. At this point, we are looking to fix in i40iw with an algo that
> looks at physical page address.

Ideally I'd like this in the core code.. So if you are going to write
and test something new, lets do that.

> Not sure I follow the desync argument as the pages are pinned and given
> to us by the core. But if this is really a concern, there are other places
> where the VMA scan is done too.
>
>https://elixir.bootlin.com/linux/v4.18-rc7/source/drivers/infiniband/hw/mlx4/mr.c#L389

This is a bit tricky, but becausae it is done before get_user_pages,
and only influences the access flags, it is not racy in a way that
really matters. At worst a race would cause ib_umem_get to return fail
when maybe it shouldn't have.

> https://elixir.bootlin.com/linux/v4.18-rc6/source/drivers/infiniband/core/umem_odp.c#L341

ODP is supposed to not be racy with changes to the mm as it doesn't
use get_user_pages in the same way..

But not sure what this is doing, or what IB_ACCESS_HUGETLB is supposed
to be..

Also kind of looks wrong, wouldn't recommend copying it :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html