On Mon, Oct 22, 2018 at 03:32:23PM -0600, Jason Gunthorpe wrote: > On Fri, Oct 19, 2018 at 06:34:07PM -0500, Shiraz Saleem wrote: > > This helper iterates through the SG list to find the best page size to use > > from a bitmap of HW supported page sizes. Drivers that support multiple > > page sizes, but not mixed pages in an MR can call this API. > > > > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxx> > > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx> > > Signed-off-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> > > drivers/infiniband/core/umem.c | 95 ++++++++++++++++++++++++++++++++++++++++++ > > include/rdma/ib_umem.h | 7 ++++ > > 2 files changed, 102 insertions(+) > > > > +static int ib_umem_find_pg_bit(unsigned long phyaddr, > > + unsigned long supported_pgsz) > > This returns an unsigned value > > > +{ > > + unsigned long num_zeroes; > > + int pg_bit; > > unsigned Right. [..] > > + > > + for_each_sg(umem->sg_head.sgl, sg, umem->sg_head.orig_nents, i) { > > + dma_addr_start = sg_dma_address(sg); > > + dma_addr_end = sg_dma_address(sg) + sg_dma_len(sg); > > + pg_bit_start = ib_umem_find_pg_bit(dma_addr_start, supported_pgsz); > > + pg_bit_end = ib_umem_find_pg_bit(dma_addr_end, supported_pgsz); > > This seems convoluted.. > > Shouldn't this just be another ffs on sg_dma_len() to compute the > maximum possible page size? > > pg_bit_start = ib_umem_find_pg_bit(dma_addr_start, > supported_pgsz & (BIT_ULL(__ffs(sg_dma_len(sg) + 1)) - 1)); > > Or so? then bit_end isn't needed and nor is the while loop below Looks simpler. But, would it work correct for last SGE? For a HW that supports 2M and 4K for example, if the last SGE starts at a 2M aligned addr and has length equal 200 4K pages, the pg_bit as per above equation would compute 12 when it should be 21. Same applies to the first SGE I think. > > + if (!i) { > > + pg_bit_sg_chunk = max_t(int, pg_bit_start, pg_bit_end); > > + bitmap = supported_pgsz; > > + /* The start offset of the MR into a first _large_ page > > + * should line up exactly for the user-space virtual buf > > + * and physical buffer, in order to upgrade the page bit > > + */ > > + while (pg_bit_sg_chunk > PAGE_SHIFT) { > > Hurm.. This doesn't look quite right. > > This is a HW restriction where we must have RDMA_ADDR & PAGE_SIZE-1 == > 0, for all the interior pages, right? ie the HW is computing the > offset into the DMA list naively? Yes. For VA-based, least significant bits (12 bits for 4K page size, 21 bits for 2M page size etc.) of the VA indicate offset. > First of all, umem->address is the wrong thing to use, the RDMA > virtual MR base address could be zero (ie IB_ZERO_BASED was used) > there should be an accessor helper for this. Yes. Zero based VA was not considered. > Second this should only be checked once, not every loop. If we know > the RDMA MR base address, the length of the first sgl, and the page > size, we can do a single computation to reduce the page size to make > it sufficiently aligned to the MR base. I am not on this page yet (no pun intended :) ). But will spend some time figuring how this can be done. Shiraz