On Wed, Oct 24, 2018 at 07:05:11PM -0500, Shiraz Saleem wrote: > On Mon, Oct 22, 2018 at 03:32:23PM -0600, Jason Gunthorpe wrote: > > On Fri, Oct 19, 2018 at 06:34:07PM -0500, Shiraz Saleem wrote: > > > This helper iterates through the SG list to find the best page size to use > > > from a bitmap of HW supported page sizes. Drivers that support multiple > > > page sizes, but not mixed pages in an MR can call this API. > > > > > > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxx> > > > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx> > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> > > > drivers/infiniband/core/umem.c | 95 ++++++++++++++++++++++++++++++++++++++++++ > > > include/rdma/ib_umem.h | 7 ++++ > > > 2 files changed, 102 insertions(+) > > > > > > +static int ib_umem_find_pg_bit(unsigned long phyaddr, > > > + unsigned long supported_pgsz) > > > > This returns an unsigned value > > > > > +{ > > > + unsigned long num_zeroes; > > > + int pg_bit; > > > > unsigned > > Right. > > [..] > > > + > > > + for_each_sg(umem->sg_head.sgl, sg, umem->sg_head.orig_nents, i) { > > > + dma_addr_start = sg_dma_address(sg); > > > + dma_addr_end = sg_dma_address(sg) + sg_dma_len(sg); > > > + pg_bit_start = ib_umem_find_pg_bit(dma_addr_start, supported_pgsz); > > > + pg_bit_end = ib_umem_find_pg_bit(dma_addr_end, supported_pgsz); > > > > This seems convoluted.. > > > > Shouldn't this just be another ffs on sg_dma_len() to compute the > > maximum possible page size? > > > > pg_bit_start = ib_umem_find_pg_bit(dma_addr_start, > > supported_pgsz & (BIT_ULL(__ffs(sg_dma_len(sg) + 1)) - 1)); > > > > Or so? then bit_end isn't needed and nor is the while loop below > > Looks simpler. But, would it work correct for last SGE? No, last SGE checks alignment of the start only First SGE checks alignment of the end only (presumably? depends on HW I guess) > > > + if (!i) { > > > + pg_bit_sg_chunk = max_t(int, pg_bit_start, pg_bit_end); > > > + bitmap = supported_pgsz; > > > + /* The start offset of the MR into a first _large_ page > > > + * should line up exactly for the user-space virtual buf > > > + * and physical buffer, in order to upgrade the page bit > > > + */ > > > + while (pg_bit_sg_chunk > PAGE_SHIFT) { > > > > Hurm.. This doesn't look quite right. > > > > This is a HW restriction where we must have RDMA_ADDR & PAGE_SIZE-1 == > > 0, for all the interior pages, right? ie the HW is computing the > > offset into the DMA list naively? > > Yes. For VA-based, least significant bits (12 bits for 4K page size, > 21 bits for 2M page size etc.) of the VA indicate offset. I wonder if all HW works this way or if we need a flag. > > First of all, umem->address is the wrong thing to use, the RDMA > > virtual MR base address could be zero (ie IB_ZERO_BASED was used) > > there should be an accessor helper for this. > > Yes. Zero based VA was not considered. > > > Second this should only be checked once, not every loop. If we know > > the RDMA MR base address, the length of the first sgl, and the page > > size, we can do a single computation to reduce the page size to make > > it sufficiently aligned to the MR base. > > I am not on this page yet (no pun intended :) ). But will spend some > time figuring how this can be done. basically compute the 'end' of the first page in virtual space and that is initial minimum alignement requirement. Ah.. but I suppose this has to be checked again any time the page size is reduced, since the page forcing the reduction may not satisfy the requirement. Still this restriction should be doable without the inner while loop. Jason