Re: [PATCH rdma-next] RDMA/i40w: Hold read semaphore while looking after VMA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 14, 2018 at 09:00:17AM -0500, Shiraz Saleem wrote:
> On Tue, Jul 31, 2018 at 08:31:10AM -0600, Jason Gunthorpe wrote:
> > On Tue, Jul 31, 2018 at 09:19:31AM -0500, Shiraz Saleem wrote:
> > > On Thu, Jul 26, 2018 at 10:46:09AM -0600, Jason Gunthorpe wrote:
> > > > On Thu, Jul 26, 2018 at 11:34:42AM -0500, Shiraz Saleem wrote:
> > > > > On Mon, Jul 23, 2018 at 04:29:04PM -0600, Jason Gunthorpe wrote:
> > > > > > On Mon, Jul 23, 2018 at 05:19:41PM -0500, Shiraz Saleem wrote:
> > > > > > > On Mon, Jul 23, 2018 at 02:39:56PM -0600, Jason Gunthorpe wrote:
> > > > > > > > On Sat, Jul 21, 2018 at 08:31:30AM -0500, Shiraz Saleem wrote:
> > > > > > > > > On Sun, Jul 01, 2018 at 10:36:24AM -0600, Leon Romanovsky wrote:
> > > > > > > > > > From: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > > > > 
> > > > > > > > > > VMA lookup is supposed to be performed while mmap_sem is held.
> > > > > > > > > > 
> > > > > > > > > > Fixes: f26c7c83395b ("i40iw: Add 2MB page support")
> > > > > > > > > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > > > > > > > > >
> > > > > > > > > Thanks for fixing this bug.
> > > > > > > > > 
> > > > > > > > > Acked-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> 
> > > > > > > > 
> > > > > > > > This code in the driver all looks wrong to me, it should not rely on
> > > > > > > > the huge pages flag and it shouldn't do this VM walking.
> > > > > > > >
> > > > > > > Hi Jason - Why is that we cant use umem->hugetlb to determine if
> > > > > > > MR is composed of huge pages? Isnt that why the flag was defined for
> > > > > > > in the first place. All the VMAs w.r.t the user-space buffer are
> > > > > > > scanned to make this determination in ib_umem_get.
> > > > > > > Isnt it better to use it than implement it in individual driver.
> > > > > > 
> > > > > > It was a mistake to have this flag and all drivers except i40iw and
> > > > > > bnxt don't use it. I wish to remove it.
> > > > > I see. What was the drawback to the approach?
> > > > 
> > > > It is just wongly designed.. page merging is about combining physical
> > > > addresses within a HW supported size list, not about matching the CPU
> > > > huge page size(s) to HW.
> > > 
> > > Well ok. That is a different way of looking at it.
> > > Our assumption was tied to the fact that the OS makes
> > > available only a few page sizes (depending on the CPU arch) and
> > > huge pages fall in that bracket. So it made sense for it be
> > > supported by core. But your approach seems to give more
> > > flexibility and gets rid of the locking. 
> > 
> > There are various patches that produce contiguous physical allocations
> > in various ways beyond the huge page size restriction. Drivers must
> > look at the physical pages to take advantage of that.
> > 
> > > > We should never even care about 'huge pages' in the VM sense, and
> > > > certainly never scan a VM, and it is buggy/racy to do so, I
> > > > think. Leon already noticed missing locking, but I think there is
> > > > worse here since we have desync'd with the lock that get_user_pages
> > > > was run under.
> > > 
> > > OK. At this point, we are looking to fix in i40iw with an algo that
> > > looks at physical page address.
> > 
> > Ideally I'd like this in the core code.. So if you are going to write
> > and test something new, lets do that.
> 
> 
> Ok. As you suggested in earlier thread, we could have drivers pass in bitmap of supported
> HW pg sizes to ib_umem_get and it would returned a selected size along with whether the
> MR is contiguous or not.
> 
> With respect to the algo in ib_umem_get, I am considering 2 options --
> 
> 1. Scan the entire page list returned after pinning check if satisfies the
> HW pg size (starting with default PAGE_SIZE and higher). The page_list and the
> selected HW size is then used size the SG table and build the minimal SG list
> depending on HW size. Similar to sg_alloc_table_from_pages().

It is unclear if saving memory like this is really needed or not, but
yes, that is possible.

> 2. Leave the current pinning and SG list creation same.
> https://elixir.bootlin.com/linux/v4.18/source/drivers/infiniband/core/umem.c#L192

Optimizing the SG table should optimize the dma_mapping overhead,
particularly with IOMMU, so I'd prefer to start by combining sg
entries.
 
> Additionally a question I have is what HW size do we select for a contiguous MR.
> For example, if a driver says its support 4K, 64K and 1G. And we get an MR of 128K
> that is contiguous, what does the core select for HW size since all 3 can be done?
> Maybe the largest supported HW size that is <= 1/2 of MR len?
> So in this case, it would be 64K. 

It should break it into the largest driver supported page size - the
goal of the algorithm is to give the driver the smallest list of
pages.

The tricky bit is that when we create the sg list the DMA mapping will
preserve the sizes, but not necessarily the alignment. So we might get
a 2M page starting at some random unaligned 4k offset. HW can't deal
with this, so another pass is needed post DMA mapping working on the
physical pages that would break up any unalignment.

AFAIK this is a obscure case that does not happen in the common high
performance path we expect.

The HW driver would have to iterate ove the SG list with a helper
something like

for (ib_umem_start_phys_iter(umem, &state); state->len;
     ib_umem_next_phys_iter(umem, &state) {
     .. add to hw table..->phys = state->physaddr;
     .. add to hw table..->len = state->len;
}

Where the helpers guarantee that state->physaddr is properly aligned
for state->len (they only make things smaller, not bigger)

Something that looks kind of like this off the cuff untested thing, at
least for the sg pass:

struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
					struct page *page, unsigned long npages,
					unsigned long supported_pagesz)
{
	while (npages) {
		unsigned long pfn = page_to_pfn(page);
		/* How many trailing zero bits are in the page's kernel
		 * address? */
		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
		unsigned int len_pages;
		unsigned int page_bit;

		/* Figure out the highest set bit in the supported sizes that
		 * is below the alignment of the page. ie if we have 12 zeros
		 * then we want to return bit 12 is set (ie the value 13).
		 * This algorithm guarantees that all pages are at least
		 * aligned on their size in the kernel address.
		 */
		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
			return NULL;

		len_pages =
			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));
		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
		sg_end = sg_next(sg_end);
		page = pfn_to_page(pfn + len_pages);
	}
	return sg_end;
}

/*
 * page_list is an array of npage struct page pointers, where each page is
 * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
 * where each bit indicates a supported page size, ie (1<<12) indicates that
 * 4K pages are supported.
 */
struct scatterlist *ib_umem_add_page_list(struct scatterlist *sg_end,
					  struct page **page_list,
					  unsigned long npages,
					  u64 supported_pagesz)
{
	unsigned long i = 0;

	while (i != npages) {
		struct page *first_page = page_list[i];
		unsigned long first_pfn = page_to_pfn(first_page);
		unsigned long len;

		/* Compute the number of contiguous pages we have starting at
		 * i */
		for (len = 0; i != npages &&
			      first_pfn + len == page_to_pfn(page_list[i]);
		     i++, len++)
			;

		/* Place the N continugous pages into the scatter table */
		sg_end = ib_umem_add_page_sg(sg_end, first_page, len,
					     supported_pagesz);
		if (unlikely(!sg_end))
			return NULL;
	}
	return sg_end;
}



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux