On Wed, Aug 22, 2018 at 03:43:56PM -0600, Jason Gunthorpe wrote: > On Wed, Aug 22, 2018 at 04:23:08PM -0500, Shiraz Saleem wrote: > > > > > Ideally I'd like this in the core code.. So if you are going to write > > > It should break it into the largest driver supported page size - the > > > goal of the algorithm is to give the driver the smallest list of > > > pages. > > > > > > The tricky bit is that when we create the sg list the DMA mapping will > > > preserve the sizes, but not necessarily the alignment. So we might get > > > a 2M page starting at some random unaligned 4k offset. HW can't deal > > > with this, so another pass is needed post DMA mapping working on the > > > physical pages that would break up any unalignment. > > > > > > AFAIK this is a obscure case that does not happen in the common high > > > performance path we expect. > > > > > > The HW driver would have to iterate ove the SG list with a helper > > > something like > > > > > > for (ib_umem_start_phys_iter(umem, &state); state->len; > > > ib_umem_next_phys_iter(umem, &state) { > > > .. add to hw table..->phys = state->physaddr; > > > .. add to hw table..->len = state->len; > > > } > > > > > > Where the helpers guarantee that state->physaddr is properly aligned > > > for state->len (they only make things smaller, not bigger) > > > > Not sure I am clear with this part "they only make things smaller, > > not bigger". If the core chose 2M HW PG size and the start of the > > MR is at an offset into the 2M page, then in driver we supply a 2M > > aligned address from the SG entry into HW table, but HW can still do > > 2M. Is this the case you are referring to? > > Sort of, if there is no DMA mapping then the alignment and sizing here > should match the alignment and sizing created below, so nothing should > change. > > With the IOMMU we might get a situation where a 2M aligned region is > IOMMU mapped to a 2M unaligned region. In this case we have to break > the 2M pages back up into 4k pages (or whatever) > > Or, another case, would be HW that uses a 64K native page size (eg > PPC), and a driver that only supports 4K. > > In this case the 64K pages would be reduced to aligned 4k ranges for > the driver. > > > > Something that looks kind of like this off the cuff untested thing, at > > > least for the sg pass: > > > > > > struct scatterlist *ib_umem_add_page_sg(struct scatterlist *sg_end, > > > struct page *page, unsigned long npages, > > > unsigned long supported_pagesz) > > > { > > > while (npages) { > > > unsigned long pfn = page_to_pfn(page); > > > /* How many trailing zero bits are in the page's kernel > > > * address? */ > > > unsigned int num_zeros = ffs(pfn) + PAGE_SHIFT - 1; > > > unsigned int len_pages; > > > unsigned int page_bit; > > > > > > /* Figure out the highest set bit in the supported sizes that > > > * is below the alignment of the page. ie if we have 12 zeros > > > * then we want to return bit 12 is set (ie the value 13). > > > * This algorithm guarantees that all pages are at least > > > * aligned on their size in the kernel address. > > > */ > > > > Comment doesnt align with code. page_bit = fls(supported_pagesz & ((1 << (num_zeros + 1)) - 1)); ? > > I didn't test it, comment is describing what to do :) > > > > page_bit = fls(supported_pagesz & ((num_zeros << 1) - 1)); > > > if (WARN_ON(page_bit < PAGE_SHIFT || page_bit == 0)) > > > return NULL; > > > > > > len_pages = > > > min_t(int, npages, (1 << (page_bit - 1 - PAGE_SHIFT))); > > > > For a 2M huge page where the start addr is offset into a 2M page, it looks like > > we are creating 4K SG entries upto the naturally aligned boundary followed by > > single SG 2M entries. Is that the intent? > > Yes. Or more specifically, this is a greedy algorithim, it always > selects the largest aligned block it can, and consumes those > pages. > > For instance if we have a 4M region composed of 2M huge pages > And the driver can support 2M,1M and 4k > and the start of the MR is offset 4k into the region > > We should get 255 4K pages, 1 1M page, 1 2M page. > > > > sg_set_page(sg_end, page, len_pages * PAGE_SIZE, 0); > > > sg_end = sg_next(sg_end); > > > page = pfn_to_page(pfn + len_pages); > > > > npages -= len_pages;? > > Yep > > > > return sg_end; > > > } > > > > > > /* > > > * page_list is an array of npage struct page pointers, where each page is > > > * PAGE_SIZE supported_pagesz is a bit mask of driver supported page sizes > > > * where each bit indicates a supported page size, ie (1<<12) indicates that > > > * 4K pages are supported. > > > */ > > > > Does page_list array contains all the npages of the MR when this function is called? > > Or is this function called after each pinning of min (npages, 512) pages? > > https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/umem.c#L180 > > I am guessing its the former in which case, dont we have the same problem of saving > > off memory as Option #1 I proposed? > > It has to be the page list after pinning, yes. > > And yes, this doesn't attempt to solve the problem with overallocating > the sg table. We can solve in a couple of different ways, but that can > be a followup, IMHO. > > After thinking about this some more I might suggest a slight changes.. > > Have the sg table contain the largest contiguous regions possible, and > ignore alignment or driver capability. This minimizes the sg table > entry size. Ie in the example above it would simply be a SG entry of > length 4M-4k OK. > This is friendliest to the IOMMU, as now the IOMMU could use IOMMU huge > pages (if supported) and the IOMMU huge page size may differ from the > driver supported size. > > Next, when processing in the driver, break up the giant sg entries > with a helper, as described. The helper is a bit more complicated > because the realignment work is pushed into it, but not seriously so. > > Second, I think we have some HW that can handle only one page size? ie > if they want to use 2M pages then all pages must be 2M? Yes. > Another helper would be needed to sweep the sg list to get the best > page size, like this: > > best_size_bitmap = ib_umem_find_single_page_size(umem, supported_sizes_bitmap); > for (ib_umem_start_phys_iter(umem, &statem, best_size_bitmap); state->len; > ib_umem_next_phys_iter(umem, &state) { > .. add to hw table..->phys = state->physaddr; > .. add to hw table..->len = state->len; > } > > HW that can support mixed pages would just use best_size == > supported_sizes bitmaps > OK. Let me spend some time now on the implementation aspects. Shiraz