Re: [PATCH rdma-next] RDMA/i40w: Hold read semaphore while looking after VMA

Shiraz Saleem <shiraz.saleem@xxxxxxxxx> · Tue, 14 Aug 2018 09:00:17 -0500

On Tue, Jul 31, 2018 at 08:31:10AM -0600, Jason Gunthorpe wrote:
> On Tue, Jul 31, 2018 at 09:19:31AM -0500, Shiraz Saleem wrote:
> > On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote:
> > > On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> > > > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > > > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > > > > From: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > > > 
> > > > > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > > > > 
> > > > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > > > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > > >
> > > > > > > > Thanks for fixing this bug.
> > > > > > > > 
> > > > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> 
> > > > > > > 
> > > > > > > This code in the driver all looks wrong to me, it should not rely on
> > > > > > > the huge pages flag and it shouldn't do this VM walking.
> > > > > > >
> > > > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > > > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > > > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > > > > scanned to make this determination in ib_umem_get.
> > > > > > Isnt it better to use it than implement it in individual driver.
> > > > > 
> > > > > It was a mistake to have this flag and all drivers except i40iw and
> > > > > bnxt don't use it. I wish to remove it.
> > > > I see. What was the drawback to the approach?
> > > 
> > > It is just wongly designed.. page merging is about combining physical
> > > addresses within a HW supported size list, not about matching the CPU
> > > huge page size(s) to HW.
> > 
> > Well ok. That is a different way of looking at it.
> > Our assumption was tied to the fact that the OS makes
> > available only a few page sizes (depending on the CPU arch) and
> > huge pages fall in that bracket. So it made sense for it be
> > supported by core. But your approach seems to give more
> > flexibility and gets rid of the locking. 
> 
> There are various patches that produce contiguous physical allocations
> in various ways beyond the huge page size restriction. Drivers must
> look at the physical pages to take advantage of that.
> 
> > > We should never even care about 'huge pages' in the VM sense, and
> > > certainly never scan a VM, and it is buggy/racy to do so, I
> > > think. Leon already noticed missing locking, but I think there is
> > > worse here since we have desync'd with the lock that get_user_pages
> > > was run under.
> > 
> > OK. At this point, we are looking to fix in i40iw with an algo that
> > looks at physical page address.
> 
> Ideally I'd like this in the core code.. So if you are going to write
> and test something new, lets do that.

Ok. As you suggested in earlier thread, we could have drivers pass in bitmap of supported
HW pg sizes to ib_umem_get and it would returned a selected size along with whether the
MR is contiguous or not.

With respect to the algo in ib_umem_get, I am considering 2 options --

1. Scan the entire page list returned after pinning check if satisfies the
HW pg size (starting with default PAGE_SIZE and higher). The page_list and the
selected HW size is then used size the SG table and build the minimal SG list
depending on HW size. Similar to sg_alloc_table_from_pages().

2. Leave the current pinning and SG list creation same.
https://elixir.bootlin.com/linux/v4.18/source/drivers/infiniband/core/umem.c#L192
Scan the SG entries to get the page list and check to see if satisfies the HW pg size.
Rebuild a new SG list (or collapse the current list) to have minimal entries if HW
size (> PAGE_SIZE) is selected.

Thoughts?

Additionally a question I have is what HW size do we select for a contiguous MR.
For example, if a driver says its support 4K, 64K and 1G. And we get an MR of 128K
that is contiguous, what does the core select for HW size since all 3 can be done?
Maybe the largest supported HW size that is <= 1/2 of MR len?
So in this case, it would be 64K. 

Shiraz

> > Not sure I follow the desync argument as the pages are pinned and given
> > to us by the core. But if this is really a concern, there are other places
> > where the VMA scan is done too.
> >
> >https://elixir.bootlin.com/linux/v4.18-rc7/source/drivers/infiniband/hw/mlx4/mr.c#L389
> 
> This is a bit tricky, but becausae it is done before get_user_pages,
> and only influences the access flags, it is not racy in a way that
> really matters. At worst a race would cause ib_umem_get to return fail
> when maybe it shouldn't have.
> 
> > https://elixir.bootlin.com/linux/v4.18-rc6/source/drivers/infiniband/core/umem_odp.c#L341
> 
> ODP is supposed to not be racy with changes to the mm as it doesn't
> use get_user_pages in the same way..
> 
> But not sure what this is doing, or what IB_ACCESS_HUGETLB is supposed
> to be..
> 
> Also kind of looks wrong, wouldn't recommend copying it :)
> 
> Jason