Re: [PATCH rdma-next] RDMA/i40w: Hold read semaphore while looking after VMA

Shiraz Saleem <shiraz.saleem@xxxxxxxxx> · Tue, 28 Aug 2018 14:27:15 -0500

On Wed, Aug 22, 2018 at 03:43:56PM -0600, Jason Gunthorpe wrote:
> On Wed, Aug 22, 2018 at 04:23:08PM -0500, Shiraz Saleem wrote:
> > > > > Ideally I'd like this in the core code.. So if you are going to write
> > > It should break it into the largest driver supported page size - the
> > > goal of the algorithm is to give the driver the smallest list of
> > > pages.
> > > 
> > > The tricky bit is that when we create the sg list the DMA mapping will
> > > preserve the sizes, but not necessarily the alignment. So we might get
> > > a 2M page starting at some random unaligned 4k offset. HW can't deal
> > > with this, so another pass is needed post DMA mapping working on the
> > > physical pages that would break up any unalignment.
> > > 
> > > AFAIK this is a obscure case that does not happen in the common high
> > > performance path we expect.
> > > 
> > > The HW driver would have to iterate ove the SG list with a helper
> > > something like
> > > 
> > > for (ib_umem_start_phys_iter(umem, &state); state->len;
> > >      ib_umem_next_phys_iter(umem, &state) {
> > >      .. add to hw table..->phys = state->physaddr;
> > >      .. add to hw table..->len = state->len;
> > > }
> > > 
> > > Where the helpers guarantee that state->physaddr is properly aligned
> > > for state->len (they only make things smaller, not bigger)
> > 
> > Not sure I am clear with this part "they only make things smaller,
> > not bigger".  If the core chose 2M HW PG size and the start of the
> > MR is at an offset into the 2M page, then in driver we supply a 2M
> > aligned address from the SG entry into HW table, but HW can still do
> > 2M. Is this the case you are referring to?
> 
> Sort of, if there is no DMA mapping then the alignment and sizing here
> should match the alignment and sizing created below, so nothing should
> change.
> 
> With the IOMMU we might get a situation where a 2M aligned region is
> IOMMU mapped to a 2M unaligned region. In this case we have to break
> the 2M pages back up into 4k pages (or whatever)
> 
> Or, another case, would be HW that uses a 64K native page size (eg
> PPC), and a driver that only supports 4K.
> 
> In this case the 64K pages would be reduced to aligned 4k ranges for
> the driver.
> 
> > > Something that looks kind of like this off the cuff untested thing, at
> > > least for the sg pass:
> > > 
> > > struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
> > > 					struct page *page, unsigned long npages,
> > > 					unsigned long supported_pagesz)
> > > {
> > > 	while (npages) {
> > > 		unsigned long pfn = page_to_pfn(page);
> > > 		/* How many trailing zero bits are in the page's kernel
> > > 		 * address? */
> > > 		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
> > > 		unsigned int len_pages;
> > > 		unsigned int page_bit;
> > > 
> > > 		/* Figure out the highest set bit in the supported sizes that
> > > 		 * is below the alignment of the page. ie if we have 12 zeros
> > > 		 * then we want to return bit 12 is set (ie the value 13).
> > > 		 * This algorithm guarantees that all pages are at least
> > > 		 * aligned on their size in the kernel address.
> > > 		 */
> > 
> > Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ?
> 
> I didn't test it, comment is describing what to do :)
> 
> > > 		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
> > > 		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
> > > 			return NULL;
> > > 
> > > 		len_pages =
> > > 			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));
> > 
> > For a 2M huge page where the start addr is offset into a 2M page, it looks like
> > we are creating 4K SG entries upto the naturally aligned boundary followed by
> > single SG 2M entries. Is that the intent?
> 
> Yes. Or more specifically, this is a greedy algorithim, it always
> selects the largest aligned block it can, and consumes those
> pages.
> 
> For instance if we have a 4M region composed of 2M huge pages
> And the driver can support 2M,1M and 4k
> and the start of the MR is offset 4k into the region
> 
> We should get 255 4K pages, 1 1M page, 1 2M page.
> 
> > > 		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
> > > 		sg_end = sg_next(sg_end);
> > > 		page = pfn_to_page(pfn + len_pages);
> > 
> > npages -= len_pages;?
> 
> Yep
> 
> > > 	return sg_end;
> > > }
> > > 
> > > /*
> > >  * page_list is an array of npage struct page pointers, where each page is
> > >  * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
> > >  * where each bit indicates a supported page size, ie (1<<12) indicates that
> > >  * 4K pages are supported.
> > >  */
> > 
> > Does page_list array contains all the npages of the MR when this function is called?
> > Or is this function called after each pinning of min (npages, 512) pages?
> > https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180
> > I am guessing its the former in which case, dont we have the same problem of saving
> > off memory as Option #1 I proposed?
> 
> It has to be the page list after pinning, yes.
> 
> And yes, this doesn't attempt to solve the problem with overallocating
> the sg table. We can solve in a couple of different ways, but that can
> be a followup, IMHO.
> 
> After thinking about this some more I might suggest a slight changes..
> 
> Have the sg table contain the largest contiguous regions possible, and
> ignore alignment or driver capability. This minimizes the sg table
> entry size. Ie in the example above it would simply be a SG entry of
> length 4M-4k

OK.

> This is friendliest to the IOMMU, as now the IOMMU could use IOMMU huge
> pages (if supported) and the IOMMU huge page size may differ from the
> driver supported size.
> 
> Next, when processing in the driver, break up the giant sg entries
> with a helper, as described. The helper is a bit more complicated
> because the realignment work is pushed into it, but not seriously so.
> 
> Second, I think we have some HW that can handle only one page size? ie
> if they want to use 2M pages then all pages must be 2M?

Yes.

> Another helper would be needed to sweep the sg list to get the best
> page size, like this:
> 
>  best_size_bitmap = ib_umem_find_single_page_size(umem, supported_sizes_bitmap);
>  for (ib_umem_start_phys_iter(umem, &statem, best_size_bitmap); state->len;
>       ib_umem_next_phys_iter(umem, &state) {
>       .. add to hw table..->phys = state->physaddr;
>       .. add to hw table..->len = state->len;
>  }
> 
> HW that can support mixed pages would just use best_size ==
> supported_sizes bitmaps
> 
OK. Let me spend some time now on the implementation aspects.

Shiraz