On Wed, Nov 04, 2020 at 02:06:34PM -0800, Jianxin Xiong wrote: > + umem = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, access_flags, > + &mlx5_ib_dmabuf_attach_ops); > + if (IS_ERR(umem)) { > + mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem)); > + return ERR_PTR(PTR_ERR(umem)); > + } > + > + mr = alloc_mr_from_cache(pd, umem, virt_addr, access_flags); It is very subtle, but this calls mlx5_umem_find_best_pgsz() which calls ib_umem_find_best_pgsz() which goes over the SGL to determine the page size to use. As part of this it does validation of the IOVA vs first page offset vs first page dma address. These little details come into play if the IOVA and offset are not PAGE_SIZE aligned, which is very possible if the dma buf exporter or system PAGE_SIZE is over 4k. In other words, the dma_address of the first SGL must be the page aligned starting point of the MR. Since the 'skip' approach is being done when breaking the SGL into blocks the ib_umem_find_best_pgsz() sees an invalid page size. Slicing it has to be done in a way that gives a properly formed SGL. My suggestion is to just change the SGL in place. Iterate to the starting SGE in the SGL and assign it to the sg table, modify it to have a offset dma_address and reduced length Count the number of SGEs to span the remaning range and use that as the new nmaped Edit the last SGE to have a reduced length Upon unmap undo the edits so the exporter doesn't see the mangled SGL. It would be saner if the exporter could do this, but oh well. Approximately like this: struct ib_umem *umem = &umem_p->umem; struct scatterlist *sg; int i; for_each_sg(umem_p->umem.sg_head.sgl, sg, umem_p->umem.nmap, i) { if (cur + sg_dma_len(sg) > ALIGN_DOWN(umem->address, PAGE_SIZE)) { unsigned long offset; umem_p->first_sg = sg; umem_p->first_dma_address = sg->dma_address; umem_p->first_dma_length = sg_dma_len(sg); umem_p->first_length = sg->length; offset = ALIGN_DOWN(umem->addressm PAGE_SIZE) - cur; sg->dma_address += offset; sg_dma_len(sg) -= offset; sg->length -= offset; } if (ALIGN(umem->address + umem->length, PAGE_SIZE) < cur + sg_dma_len(sg)) { unsigned long trim; umem_p->last_sg = sg; umem_p->last_dma_length = sg_dma_len(sg); umem_p->last_length = sg->length; trim = cur + sg_dma_len(sg) - ALIGN(umem->address + umem->length, PAGE_SIZE); sg_dma_len(sg) -= trim; sg->length -= trim; return npages; } cur += sg_dma_len(sg); } /* It is essential that the length of the SGL exactly match the adjusted page aligned length of umem->length */ return -EINVAL; Further, this really only works if the umem->page_size is locked to 4k because this doesn't have code to resize the MKEY, or change the underlying page size when the SGL changes. So, I'd say put something like the above in the core code to validate and properly form the umem->sgl Then modify the alloc_mr_from_cache to use only PAGE_SIZE: if (umem->is_dma_buf) page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, iova); else page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 0, iova); Jason