> -----Original Message----- > From: Jason Gunthorpe <jgg@xxxxxxxx> > Sent: Wednesday, November 04, 2020 4:07 PM > To: Xiong, Jianxin <jianxin.xiong@xxxxxxxxx> > Cc: linux-rdma@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Doug Ledford <dledford@xxxxxxxxxx>; Leon Romanovsky > <leon@xxxxxxxxxx>; Sumit Semwal <sumit.semwal@xxxxxxxxxx>; Christian Koenig <christian.koenig@xxxxxxx>; Vetter, Daniel > <daniel.vetter@xxxxxxxxx> > Subject: Re: [PATCH v7 4/5] RDMA/mlx5: Support dma-buf based userspace memory region > > On Wed, Nov 04, 2020 at 02:06:34PM -0800, Jianxin Xiong wrote: > > + umem = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd, access_flags, > > + &mlx5_ib_dmabuf_attach_ops); > > + if (IS_ERR(umem)) { > > + mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem)); > > + return ERR_PTR(PTR_ERR(umem)); > > + } > > + > > + mr = alloc_mr_from_cache(pd, umem, virt_addr, access_flags); > > It is very subtle, but this calls mlx5_umem_find_best_pgsz() which calls ib_umem_find_best_pgsz() which goes over the SGL to determine > the page size to use. > When this is called here, the umem sglist is still NULL because dma_buf_map_attachment() is not called until a page fault occurs. In patch 1/5, the function ib_umem_find_best_pgsz() has been modified to always return PAGE_SIZE for dma-buf based MR. > As part of this it does validation of the IOVA vs first page offset vs first page dma address. These little details come into play if the IOVA and > offset are not PAGE_SIZE aligned, which is very possible if the dma buf exporter or system PAGE_SIZE is over 4k. > > In other words, the dma_address of the first SGL must be the page aligned starting point of the MR. Since the 'skip' approach is being done > when breaking the SGL into blocks the ib_umem_find_best_pgsz() sees an invalid page size. > > Slicing it has to be done in a way that gives a properly formed SGL. > > My suggestion is to just change the SGL in place. Iterate to the starting SGE in the SGL and assign it to the sg table, modify it to have a offset > dma_address and reduced length > > Count the number of SGEs to span the remaning range and use that as the new nmaped > > Edit the last SGE to have a reduced length Do you still think modifying the SGL in place needed given the above explanation? I do see some benefits of doing so -- hiding the discrepancy of sgl and addr/length from the device drivers and avoid special handling in the code that use the sgl. > > Upon unmap undo the edits so the exporter doesn't see the mangled SGL. > > It would be saner if the exporter could do this, but oh well. > > Approximately like this: > > struct ib_umem *umem = &umem_p->umem; > struct scatterlist *sg; > int i; > > for_each_sg(umem_p->umem.sg_head.sgl, sg, umem_p->umem.nmap, i) { > if (cur + sg_dma_len(sg) > ALIGN_DOWN(umem->address, PAGE_SIZE)) { > unsigned long offset; > > umem_p->first_sg = sg; > umem_p->first_dma_address = sg->dma_address; > umem_p->first_dma_length = sg_dma_len(sg); > umem_p->first_length = sg->length; > offset = ALIGN_DOWN(umem->addressm PAGE_SIZE) - cur; > sg->dma_address += offset; > sg_dma_len(sg) -= offset; > sg->length -= offset; > } > if (ALIGN(umem->address + umem->length, PAGE_SIZE) < cur + sg_dma_len(sg)) { > unsigned long trim; > > umem_p->last_sg = sg; > umem_p->last_dma_length = sg_dma_len(sg); > umem_p->last_length = sg->length; > trim = cur + sg_dma_len(sg) - ALIGN(umem->address + umem->length, PAGE_SIZE); > sg_dma_len(sg) -= trim; > sg->length -= trim; > return npages; > } > cur += sg_dma_len(sg); > } > /* It is essential that the length of the SGL exactly match > the adjusted page aligned length of umem->length */ > return -EINVAL; > > Further, this really only works if the umem->page_size is locked to 4k because this doesn't have code to resize the MKEY, or change the > underlying page size when the SGL changes. Yes, now it's locked to 4K. > > So, I'd say put something like the above in the core code to validate and properly form the umem->sgl > > Then modify the alloc_mr_from_cache to use only PAGE_SIZE: > > if (umem->is_dma_buf) > page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, iova); else > page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 0, iova); > This should have been addressed in patch 1/5. Thanks, Jianxin > Jason