On Tue, Jul 31, 2018 at 08:31:10AM -0600, Jason Gunthorpe wrote: > On Tue, Jul 31, 2018 at 09:19:31AM -0500, Shiraz Saleem wrote: > > On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote: > > > On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote: > > > > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote: > > > > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote: > > > > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote: > > > > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote: > > > > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote: > > > > > > > > > From: Leon Romanovsky <leonro@xxxxxxxxxxxx> > > > > > > > > > > > > > > > > > > VMA lookup is supposed to be performed while mmap_sem is held. > > > > > > > > > > > > > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support") > > > > > > > > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx> > > > > > > > > > > > > > > > > > Thanks for fixing this bug. > > > > > > > > > > > > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> > > > > > > > > > > > > > > This code in the driver all looks wrong to me, it should not rely on > > > > > > > the huge pages flag and it shouldn't do this VM walking. > > > > > > > > > > > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if > > > > > > MR is composed of huge pages? Isnt that why the flag was defined for > > > > > > in the first place. All the VMAs w.r.t the user-space buffer are > > > > > > scanned to make this determination in ib_umem_get. > > > > > > Isnt it better to use it than implement it in individual driver. > > > > > > > > > > It was a mistake to have this flag and all drivers except i40iw and > > > > > bnxt don't use it. I wish to remove it. > > > > I see. What was the drawback to the approach? > > > > > > It is just wongly designed.. page merging is about combining physical > > > addresses within a HW supported size list, not about matching the CPU > > > huge page size(s) to HW. > > > > Well ok. That is a different way of looking at it. > > Our assumption was tied to the fact that the OS makes > > available only a few page sizes (depending on the CPU arch) and > > huge pages fall in that bracket. So it made sense for it be > > supported by core. But your approach seems to give more > > flexibility and gets rid of the locking. > > There are various patches that produce contiguous physical allocations > in various ways beyond the huge page size restriction. Drivers must > look at the physical pages to take advantage of that. > > > > We should never even care about 'huge pages' in the VM sense, and > > > certainly never scan a VM, and it is buggy/racy to do so, I > > > think. Leon already noticed missing locking, but I think there is > > > worse here since we have desync'd with the lock that get_user_pages > > > was run under. > > > > OK. At this point, we are looking to fix in i40iw with an algo that > > looks at physical page address. > > Ideally I'd like this in the core code.. So if you are going to write > and test something new, lets do that. Ok. As you suggested in earlier thread, we could have drivers pass in bitmap of supported HW pg sizes to ib_umem_get and it would returned a selected size along with whether the MR is contiguous or not. With respect to the algo in ib_umem_get, I am considering 2 options -- 1. Scan the entire page list returned after pinning check if satisfies the HW pg size (starting with default PAGE_SIZE and higher). The page_list and the selected HW size is then used size the SG table and build the minimal SG list depending on HW size. Similar to sg_alloc_table_from_pages(). 2. Leave the current pinning and SG list creation same. https://elixir.bootlin.com/linux/v4.18/source/drivers/infiniband/core/umem.c#L192 Scan the SG entries to get the page list and check to see if satisfies the HW pg size. Rebuild a new SG list (or collapse the current list) to have minimal entries if HW size (> PAGE_SIZE) is selected. Thoughts? Additionally a question I have is what HW size do we select for a contiguous MR. For example, if a driver says its support 4K, 64K and 1G. And we get an MR of 128K that is contiguous, what does the core select for HW size since all 3 can be done? Maybe the largest supported HW size that is <= 1/2 of MR len? So in this case, it would be 64K. Shiraz > > Not sure I follow the desync argument as the pages are pinned and given > > to us by the core. But if this is really a concern, there are other places > > where the VMA scan is done too. > > > >https://elixir.bootlin.com/linux/v4.18-rc7/source/drivers/infiniband/hw/mlx4/mr.c#L389 > > This is a bit tricky, but becausae it is done before get_user_pages, > and only influences the access flags, it is not racy in a way that > really matters. At worst a race would cause ib_umem_get to return fail > when maybe it shouldn't have. > > > https://elixir.bootlin.com/linux/v4.18-rc6/source/drivers/infiniband/core/umem_odp.c#L341 > > ODP is supposed to not be racy with changes to the mm as it doesn't > use get_user_pages in the same way.. > > But not sure what this is doing, or what IB_ACCESS_HUGETLB is supposed > to be.. > > Also kind of looks wrong, wouldn't recommend copying it :) > > Jason