Re: [PATCH rdma-next] RDMA/i40w: Hold read semaphore while looking after VMA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> > > Ideally I'd like this in the core code.. So if you are going to write
> > > and test something new, lets do that.

Hi Jason - Thanks for sharing your thoughts. I think I have a better handle of what needs to be
done, but, do have further questions. See inline.
 
> > 
> > Ok. As you suggested in earlier thread, we could have drivers pass in bitmap of supported
> > HW pg sizes to ib_umem_get and it would returned a selected size along with whether the
> > MR is contiguous or not.
> > 
> > With respect to the algo in ib_umem_get, I am considering 2 options --
> > 
> > 1. Scan the entire page list returned after pinning check if satisfies the
> > HW pg size (starting with default PAGE_SIZE and higher). The page_list and the
> > selected HW size is then used size the SG table and build the minimal SG list
> > depending on HW size. Similar to sg_alloc_table_from_pages().
> 
> It is unclear if saving memory like this is really needed or not, but
> yes, that is possible.
> 
> > 2. Leave the current pinning and SG list creation same.
> > https://elixir.bootlin.com/linux/v4.18/source/drivers/infiniband/core/umem.c#L192
> 
> Optimizing the SG table should optimize the dma_mapping overhead,
> particularly with IOMMU, so I'd prefer to start by combining sg
> entries.
>  
> > Additionally a question I have is what HW size do we select for a contiguous MR.
> > For example, if a driver says its support 4K, 64K and 1G. And we get an MR of 128K
> > that is contiguous, what does the core select for HW size since all 3 can be done?
> > Maybe the largest supported HW size that is <= 1/2 of MR len?
> > So in this case, it would be 64K. 
> 
> It should break it into the largest driver supported page size - the
> goal of the algorithm is to give the driver the smallest list of
> pages.
> 
> The tricky bit is that when we create the sg list the DMA mapping will
> preserve the sizes, but not necessarily the alignment. So we might get
> a 2M page starting at some random unaligned 4k offset. HW can't deal
> with this, so another pass is needed post DMA mapping working on the
> physical pages that would break up any unalignment.
> 
> AFAIK this is a obscure case that does not happen in the common high
> performance path we expect.
> 
> The HW driver would have to iterate ove the SG list with a helper
> something like
> 
> for (ib_umem_start_phys_iter(umem, &state); state->len;
>      ib_umem_next_phys_iter(umem, &state) {
>      .. add to hw table..->phys = state->physaddr;
>      .. add to hw table..->len = state->len;
> }
> 
> Where the helpers guarantee that state->physaddr is properly aligned
> for state->len (they only make things smaller, not bigger)

Not sure I am clear with this part "they only make things smaller, not bigger".
If the core chose 2M HW PG size and the start of the MR is at an offset
into the 2M page, then in driver we supply a 2M aligned address from the
SG entry into HW table, but HW can still do 2M. Is this the case you are referring to?

> Something that looks kind of like this off the cuff untested thing, at
> least for the sg pass:
> 
> struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end,
> 					struct page *page, unsigned long npages,
> 					unsigned long supported_pagesz)
> {
> 	while (npages) {
> 		unsigned long pfn = page_to_pfn(page);
> 		/* How many trailing zero bits are in the page's kernel
> 		 * address? */
> 		unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1;
> 		unsigned int len_pages;
> 		unsigned int page_bit;
> 
> 		/* Figure out the highest set bit in the supported sizes that
> 		 * is below the alignment of the page. ie if we have 12 zeros
> 		 * then we want to return bit 12 is set (ie the value 13).
> 		 * This algorithm guarantees that all pages are at least
> 		 * aligned on their size in the kernel address.
> 		 */

Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ?

> 		page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1));
> 		if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0))
> 			return NULL;
> 
> 		len_pages =
> 			min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT)));

For a 2M huge page where the start addr is offset into a 2M page, it looks like
we are creating 4K SG entries upto the naturally aligned boundary followed by
single SG 2M entries. Is that the intent?

> 		sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0);
> 		sg_end = sg_next(sg_end);
> 		page = pfn_to_page(pfn + len_pages);

npages -= len_pages;?

> 	}
> 	return sg_end;
> }
> 
> /*
>  * page_list is an array of npage struct page pointers, where each page is
>  * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes
>  * where each bit indicates a supported page size, ie (1<<12) indicates that
>  * 4K pages are supported.
>  */

Does page_list array contains all the npages of the MR when this function is called?
Or is this function called after each pinning of min (npages, 512) pages?
https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180
I am guessing its the former in which case, dont we have the same problem of saving
off memory as Option #1 I proposed?

> struct scatterlist *ib_umem_add_page_list(struct scatterlist *sg_end,
> 					  struct page **page_list,
> 					  unsigned long npages,
> 					  u64 supported_pagesz)
> {
> 	unsigned long i = 0;
> 
> 	while (i != npages) {
> 		struct page *first_page = page_list[i];
> 		unsigned long first_pfn = page_to_pfn(first_page);
> 		unsigned long len;
> 
> 		/* Compute the number of contiguous pages we have starting at
> 		 * i */
> 		for (len = 0; i != npages &&
> 			      first_pfn + len == page_to_pfn(page_list[i]);
> 		     i++, len++)
> 			;
> 
> 		/* Place the N continugous pages into the scatter table */
> 		sg_end = ib_umem_add_page_sg(sg_end, first_page, len,
> 					     supported_pagesz);
> 		if (unlikely(!sg_end))
> 			return NULL;
> 	}
> 	return sg_end;
> }



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux