On Fri, Apr 19, 2019 at 08:43:49AM -0500, Shiraz Saleem wrote: > This helper iterates through the SG list to find the best page size to use > from a bitmap of HW supported page sizes. Drivers that support multiple > page sizes, but not mixed sizes in an MR can use this API. > > Suggested-by: Jason Gunthorpe <jgg@xxxxxxxx> > Cc: Gal Pressman <galpress@xxxxxxxxxx> > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx> > Signed-off-by: Shiraz Saleem <shiraz.saleem@xxxxxxxxx> > drivers/infiniband/core/umem.c | 57 ++++++++++++++++++++++++++++++++++++++++++ > include/rdma/ib_umem.h | 19 ++++++++++++++ > include/rdma/ib_verbs.h | 24 ++++++++++++++++++ > 3 files changed, 100 insertions(+) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index 7e912a9..8624ba1 100644 > +++ b/drivers/infiniband/core/umem.c > @@ -127,6 +127,63 @@ static struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg, > } > > /** > + * ib_umem_find_best_pgsz - Find best HW page size to use for this MR > + * > + * @umem: umem struct > + * @pgsz_bitmap: bitmap of HW supported page sizes > + * @flags: see enum ib_umem_find_best_pgsz_flags > + * @uvirt_addr: user-space virtual MR base address (provided if > + * IB_UMEM_VA_BASED_OFFSET flag is set) > + * > + * This helper is intended for HW that support multiple page > + * sizes but can do only a single page size in an MR. > + * > + * Returns 0 if the umem requires page sizes not supported by > + * the driver to be mapped. Drivers always supporting PAGE_SIZE > + * or smaller will never see a 0 result. > + */ > +unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem, > + unsigned long pgsz_bitmap, > + unsigned int flags, > + unsigned long uvirt_addr) > +{ > + struct scatterlist *sg; > + dma_addr_t mask = 0; > + unsigned int best_pg_bit; > + int i; > + > + if (WARN_ON(!(pgsz_bitmap & GENMASK(PAGE_SHIFT, 0)))) > + return 0; > + > + for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) { > + dma_addr_t dma_addr_start, dma_addr_end; > + > + dma_addr_start = sg_dma_address(sg); > + dma_addr_end = sg_dma_address(sg) + sg_dma_len(sg); > + if (!i) > + /* first SGE: can start on partial page size. > + * Ignore alignment of start addr. > + * For HW that uses VA_BASED_OFFSET, minimal alignment > + * restricted by end of the first page in virtual space. > + */ > + mask |= (flags & IB_UMEM_VA_BASED_OFFSET ? > + ((uvirt_addr + sg_dma_len(sg)) | dma_addr_end) : > + dma_addr_end); This doesn't look quite right to me.. Lets say we have 2M page with physical address 0x2600000 mapped into virtual address 0x6400000 If we make a MR with VA 0x6400FFF and length 2M-4095 I expect the 2M page size with the NIC. We will have dma_addr_start = 0x2600000 and sg_dma_len = 0x200000 as the SGL is always rounded to pages So the above mask math will compute: 0x2600000 + (0x6400FFF + 0x200000) = 0x8c00fff I have a feeling the uvirt_addr should be computed more like if (flags & IB_UMEM_VA_BASED_OFFSET) mask |= uvirt_addr ^ umem->addr; As umem->addr & hw_page_mask implicitly includes the offset into the page, and if the uvirt_addr has a different offset bit then we can't map 1:1 and thus have to reduce the page size. For the above example if uvirt_addr = 0x6400FFF and umem->addr = 0x6400FFF because they are the same then the xor is 0. If uvirt_addr = 0 (ie ZERO_BASED) and umem->addr is 0x6400FFF then the function fails because this HW cannot create a zero based MR that is not page aligned If uvirt_addr = 0xFFF (ie offset'd ZERO_BASED) then (0xFFF ^ 0x6400FFF) = 0x6400000 setting the max HW page size to 4M (as expected) .. and a more resonable case of say, uvirt_addr = 0 and umem->addr = 0x271000, then 0 ^ 0x271000 -> means the biggest page size is 4k, which is correct. Yes? Jason