Re: [PATCH rdma-next] RDMA/i40w: Hold read semaphore while looking after VMA

Jason Gunthorpe <jgg@xxxxxxxx> · Wed, 22 Aug 2018 15:43:56 -0600

On Wed, Aug 22, 2018 at 04:23:08PM -0500, Shiraz Saleem wrote:
> > > > Ideally I'd like this in the core code.. So if you are going to write
> > It should break it into the largest driver supported page size - the
> > goal of the algorithm is to give the driver the smallest list of
> > pages.
> > 
> > The tricky bit is that when we create the sg list the DMA mapping will
> > preserve the sizes, but not necessarily the alignment. So we might get
> > a 2M page starting at some random unaligned 4k offset. HW can't deal
> > with this, so another pass is needed post DMA mapping working on the
> > physical pages that would break up any unalignment.
> > 
> > AFAIK this is a obscure case that does not happen in the common high
> > performance path we expect.
> > 
> > The HW driver would have to iterate ove the SG list with a helper
> > something like
> > 
> > for (ib_umem_start_phys_iter(umem, &state); state->len;
> >      ib_umem_next_phys_iter(umem, &state) {
> >      .. add to hw table..->phys = state->physaddr;
> >      .. add to hw table..->len = state->len;
> > }
> > 
> > Where the helpers guarantee that state->physaddr is properly aligned
> > for state->len (they only make things smaller, not bigger)
> 
> Not sure I am clear with this part "they only make things smaller,
> not bigger".  If the core chose 2M HW PG size and the start of the
> MR is at an offset into the 2M page, then in driver we supply a 2M
> aligned address from the SG entry into HW table, but HW can still do
> 2M. Is this the case you are referring to?

Sort of, if there is no DMA mapping then the alignment and sizing here
should match the alignment and sizing created below, so nothing should
change.

With the IOMMU we might get a situation where a 2M aligned region is
IOMMU mapped to a 2M unaligned region. In this case we have to break
the 2M pages back up into 4k pages (or whatever)

Or, another case, would be HW that uses a 64K native page size (eg
PPC), and a driver that only supports 4K.

In this case the 64K pages would be reduced to aligned 4k ranges for
the driver.

> > Something that looks kind of like this off the cuff untested thing, at
> > least for the sg pass:
> > 
> > struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
> > 					struct page *page, unsigned long npages,
> > 					unsigned long supported_pagesz)
> > {
> > 	while (npages) {
> > 		unsigned long pfn = page_to_pfn(page);
> > 		/* How many trailing zero bits are in the page's kernel
> > 		 * address? */
> > 		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
> > 		unsigned int len_pages;
> > 		unsigned int page_bit;
> > 
> > 		/* Figure out the highest set bit in the supported sizes that
> > 		 * is below the alignment of the page. ie if we have 12 zeros
> > 		 * then we want to return bit 12 is set (ie the value 13).
> > 		 * This algorithm guarantees that all pages are at least
> > 		 * aligned on their size in the kernel address.
> > 		 */
> 
> Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ?

I didn't test it, comment is describing what to do :)

> > 		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
> > 		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
> > 			return NULL;
> > 
> > 		len_pages =
> > 			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));
> 
> For a 2M huge page where the start addr is offset into a 2M page, it looks like
> we are creating 4K SG entries upto the naturally aligned boundary followed by
> single SG 2M entries. Is that the intent?

Yes. Or more specifically, this is a greedy algorithim, it always
selects the largest aligned block it can, and consumes those
pages.

For instance if we have a 4M region composed of 2M huge pages
And the driver can support 2M,1M and 4k
and the start of the MR is offset 4k into the region

We should get 255 4K pages, 1 1M page, 1 2M page.

> > 		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
> > 		sg_end = sg_next(sg_end);
> > 		page = pfn_to_page(pfn + len_pages);
> 
> npages -= len_pages;?

Yep

> > 	return sg_end;
> > }
> > 
> > /*
> >  * page_list is an array of npage struct page pointers, where each page is
> >  * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
> >  * where each bit indicates a supported page size, ie (1<<12) indicates that
> >  * 4K pages are supported.
> >  */
> 
> Does page_list array contains all the npages of the MR when this function is called?
> Or is this function called after each pinning of min (npages, 512) pages?
> https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180
> I am guessing its the former in which case, dont we have the same problem of saving
> off memory as Option #1 I proposed?

It has to be the page list after pinning, yes.

And yes, this doesn't attempt to solve the problem with overallocating
the sg table. We can solve in a couple of different ways, but that can
be a followup, IMHO.

After thinking about this some more I might suggest a slight changes..

Have the sg table contain the largest contiguous regions possible, and
ignore alignment or driver capability. This minimizes the sg table
entry size. Ie in the example above it would simply be a SG entry of
length 4M-4k

This is friendliest to the IOMMU, as now the IOMMU could use IOMMU huge
pages (if supported) and the IOMMU huge page size may differ from the
driver supported size.

Next, when processing in the driver, break up the giant sg entries
with a helper, as described. The helper is a bit more complicated
because the realignment work is pushed into it, but not seriously so.

Second, I think we have some HW that can handle only one page size? ie
if they want to use 2M pages then all pages must be 2M?

Another helper would be needed to sweep the sg list to get the best
page size, like this:

 best_size_bitmap = ib_umem_find_single_page_size(umem, supported_sizes_bitmap);
 for (ib_umem_start_phys_iter(umem, &statem, best_size_bitmap); state->len;
      ib_umem_next_phys_iter(umem, &state) {
      .. add to hw table..->phys = state->physaddr;
      .. add to hw table..->len = state->len;
 }

HW that can support mixed pages would just use best_size ==
supported_sizes bitmaps

And we could support HW that can optimize with unaligned pages too, if
there is such a thing.

Jason