On Wed, Aug 22, 2018 at 04:23:08PM -0500, Shiraz Saleem wrote: > > > > Ideally I'd like this in the core code.. So if you are going to write > > It should break it into the largest driver supported page size - the > > goal of the algorithm is to give the driver the smallest list of > > pages. > > > > The tricky bit is that when we create the sg list the DMA mapping will > > preserve the sizes, but not necessarily the alignment. So we might get > > a 2M page starting at some random unaligned 4k offset. HW can't deal > > with this, so another pass is needed post DMA mapping working on the > > physical pages that would break up any unalignment. > > > > AFAIK this is a obscure case that does not happen in the common high > > performance path we expect. > > > > The HW driver would have to iterate ove the SG list with a helper > > something like > > > > for (ib_umem_start_phys_iter(umem, &state); state->len; > > ib_umem_next_phys_iter(umem, &state) { > > .. add to hw table..->phys = state->physaddr; > > .. add to hw table..->len = state->len; > > } > > > > Where the helpers guarantee that state->physaddr is properly aligned > > for state->len (they only make things smaller, not bigger) > > Not sure I am clear with this part "they only make things smaller, > not bigger". If the core chose 2M HW PG size and the start of the > MR is at an offset into the 2M page, then in driver we supply a 2M > aligned address from the SG entry into HW table, but HW can still do > 2M. Is this the case you are referring to? Sort of, if there is no DMA mapping then the alignment and sizing here should match the alignment and sizing created below, so nothing should change. With the IOMMU we might get a situation where a 2M aligned region is IOMMU mapped to a 2M unaligned region. In this case we have to break the 2M pages back up into 4k pages (or whatever) Or, another case, would be HW that uses a 64K native page size (eg PPC), and a driver that only supports 4K. In this case the 64K pages would be reduced to aligned 4k ranges for the driver. > > Something that looks kind of like this off the cuff untested thing, at > > least for the sg pass: > > > > struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end, > > struct page *page, unsigned long npages, > > unsigned long supported_pagesz) > > { > > while (npages) { > > unsigned long pfn = page_to_pfn(page); > > /* How many trailing zero bits are in the page's kernel > > * address? */ > > unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1; > > unsigned int len_pages; > > unsigned int page_bit; > > > > /* Figure out the highest set bit in the supported sizes that > > * is below the alignment of the page. ie if we have 12 zeros > > * then we want to return bit 12 is set (ie the value 13). > > * This algorithm guarantees that all pages are at least > > * aligned on their size in the kernel address. > > */ > > Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ? I didn't test it, comment is describing what to do :) > > page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1)); > > if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0)) > > return NULL; > > > > len_pages = > > min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT))); > > For a 2M huge page where the start addr is offset into a 2M page, it looks like > we are creating 4K SG entries upto the naturally aligned boundary followed by > single SG 2M entries. Is that the intent? Yes. Or more specifically, this is a greedy algorithim, it always selects the largest aligned block it can, and consumes those pages. For instance if we have a 4M region composed of 2M huge pages And the driver can support 2M,1M and 4k and the start of the MR is offset 4k into the region We should get 255 4K pages, 1 1M page, 1 2M page. > > sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0); > > sg_end = sg_next(sg_end); > > page = pfn_to_page(pfn + len_pages); > > npages -= len_pages;? Yep > > return sg_end; > > } > > > > /* > > * page_list is an array of npage struct page pointers, where each page is > > * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes > > * where each bit indicates a supported page size, ie (1<<12) indicates that > > * 4K pages are supported. > > */ > > Does page_list array contains all the npages of the MR when this function is called? > Or is this function called after each pinning of min (npages, 512) pages? > https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180 > I am guessing its the former in which case, dont we have the same problem of saving > off memory as Option #1 I proposed? It has to be the page list after pinning, yes. And yes, this doesn't attempt to solve the problem with overallocating the sg table. We can solve in a couple of different ways, but that can be a followup, IMHO. After thinking about this some more I might suggest a slight changes.. Have the sg table contain the largest contiguous regions possible, and ignore alignment or driver capability. This minimizes the sg table entry size. Ie in the example above it would simply be a SG entry of length 4M-4k This is friendliest to the IOMMU, as now the IOMMU could use IOMMU huge pages (if supported) and the IOMMU huge page size may differ from the driver supported size. Next, when processing in the driver, break up the giant sg entries with a helper, as described. The helper is a bit more complicated because the realignment work is pushed into it, but not seriously so. Second, I think we have some HW that can handle only one page size? ie if they want to use 2M pages then all pages must be 2M? Another helper would be needed to sweep the sg list to get the best page size, like this: best_size_bitmap = ib_umem_find_single_page_size(umem, supported_sizes_bitmap); for (ib_umem_start_phys_iter(umem, &statem, best_size_bitmap); state->len; ib_umem_next_phys_iter(umem, &state) { .. add to hw table..->phys = state->physaddr; .. add to hw table..->len = state->len; } HW that can support mixed pages would just use best_size == supported_sizes bitmaps And we could support HW that can optimize with unaligned pages too, if there is such a thing. Jason