> -----Original Message----- > From: Daniel Vetter <daniel@xxxxxxxx> > Sent: Tuesday, October 06, 2020 2:22 AM > To: Xiong, Jianxin <jianxin.xiong@xxxxxxxxx> > Cc: Jason Gunthorpe <jgg@xxxxxxxx>; Leon Romanovsky <leon@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; > Doug Ledford <dledford@xxxxxxxxxx>; Vetter, Daniel <daniel.vetter@xxxxxxxxx>; Christian Koenig <christian.koenig@xxxxxxx> > Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region > > On Mon, Oct 05, 2020 at 04:18:11PM +0000, Xiong, Jianxin wrote: > > > -----Original Message----- > > > From: Jason Gunthorpe <jgg@xxxxxxxx> > > > Sent: Monday, October 05, 2020 6:13 AM > > > To: Xiong, Jianxin <jianxin.xiong@xxxxxxxxx> > > > Cc: linux-rdma@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; > > > Doug Ledford <dledford@xxxxxxxxxx>; Leon Romanovsky > > > <leon@xxxxxxxxxx>; Sumit Semwal <sumit.semwal@xxxxxxxxxx>; Christian > > > Koenig <christian.koenig@xxxxxxx>; Vetter, Daniel > > > <daniel.vetter@xxxxxxxxx> > > > Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf > > > as user memory region > > > > > > On Sun, Oct 04, 2020 at 12:12:28PM -0700, Jianxin Xiong wrote: > > > > Dma-buf is a standard cross-driver buffer sharing mechanism that > > > > can be used to support peer-to-peer access from RDMA devices. > > > > > > > > Device memory exported via dma-buf is associated with a file descriptor. > > > > This is passed to the user space as a property associated with the > > > > buffer allocation. When the buffer is registered as a memory > > > > region, the file descriptor is passed to the RDMA driver along > > > > with other parameters. > > > > > > > > Implement the common code for importing dma-buf object and mapping > > > > dma-buf pages. > > > > > > > > Signed-off-by: Jianxin Xiong <jianxin.xiong@xxxxxxxxx> > > > > Reviewed-by: Sean Hefty <sean.hefty@xxxxxxxxx> > > > > Acked-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx> > > > > --- > > > > drivers/infiniband/core/Makefile | 2 +- > > > > drivers/infiniband/core/umem.c | 4 + > > > > drivers/infiniband/core/umem_dmabuf.c | 291 > > > > ++++++++++++++++++++++++++++++++++ > > > > drivers/infiniband/core/umem_dmabuf.h | 14 ++ > > > > drivers/infiniband/core/umem_odp.c | 12 ++ > > > > include/rdma/ib_umem.h | 19 ++- > > > > 6 files changed, 340 insertions(+), 2 deletions(-) create mode > > > > 100644 drivers/infiniband/core/umem_dmabuf.c > > > > create mode 100644 drivers/infiniband/core/umem_dmabuf.h > > > > > > I think this is using ODP too literally, dmabuf isn't going to need > > > fine grained page faults, and I'm not sure this locking scheme is OK - ODP is horrifically complicated. > > > > > > > > If this is the approach then I think we should make dmabuf its own > > > stand alone API, reg_user_mr_dmabuf() > > > > That's the original approach in the first version. We can go back there. > > > > > > > > The implementation in mlx5 will be much more understandable, it > > > would just do dma_buf_dynamic_attach() and program the XLT exactly the same as a normal umem. > > > > > > The move_notify() simply zap's the XLT and triggers a work to reload > > > it after the move. Locking is provided by the dma_resv_lock. Only a small disruption to the page fault handler is needed. > > > > > > > We considered such scheme but didn't go that way due to the lack of > > notification when the move is done and thus the work wouldn't know > > when it can reload. > > > > Now I think it again, we could probably signal the reload in the page fault handler. > > For reinstanting the pages you need: > > - dma_resv_lock, this prevents anyone else from issuing new moves or > anything like that > - dma_resv_get_excl + dma_fence_wait to wait for any pending moves to > finish. gpus generally don't wait on the cpu, but block the dependent > dma operations from being scheduled until that fence fired. But for rdma > odp I think you need the cpu wait in your worker here. > - get the new sg list, write it into your ptes > - dma_resv_unlock to make sure you're not racing with a concurrent > move_notify > > You can also grab multiple dma_resv_lock in atomically, but I think the odp rdma model doesn't require that (gpus need that). > > Note that you're allowed to allocate memory with GFP_KERNEL while holding dma_resv_lock, so this shouldn't impose any issues. You are > otoh not allowed to cause userspace faults (so no gup/pup or copy*user with faulting enabled). So all in all this shouldn't be any worse that > calling pup for normal umem. > > Unlike mmu notifier the caller holds dma_resv_lock already for you around the move_notify callback, so you shouldn't need any additional > locking in there (aside from what you need to zap the ptes and flush hw tlbs). > > Cheers, Daniel > Hi Daniel, thanks for providing the details. I would have missed the dma_resv_get_excl + dma_fence_wait part otherwise. > > > > > > + dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL); > > > > + sgt = dma_buf_map_attachment(umem_dmabuf->attach, > > > > + DMA_BIDIRECTIONAL); > > > > + dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv); > > > > > > This doesn't look right, this lock has to be held up until the HW is > > > programmed > > > > The mapping remains valid until being invalidated again. There is a sequence number checking before programming the HW. > > > > > > > > The use of atomic looks probably wrong as well. > > > > Do you mean umem_dmabuf->notifier_seq? Could you elaborate the concern? > > > > > > > > > + k = 0; > > > > + total_pages = ib_umem_odp_num_pages(umem_odp); > > > > + for_each_sg(umem->sg_head.sgl, sg, umem->sg_head.nents, j) { > > > > + addr = sg_dma_address(sg); > > > > + pages = sg_dma_len(sg) >> page_shift; > > > > + while (pages > 0 && k < total_pages) { > > > > + umem_odp->dma_list[k++] = addr | access_mask; > > > > + umem_odp->npages++; > > > > + addr += page_size; > > > > + pages--; > > > > > > This isn't fragmenting the sg into a page list properly, won't work > > > for unaligned things > > > > I thought the addresses are aligned, but will add explicit alignment here. > > > > > > > > And really we don't need the dma_list for this case, with a fixed > > > whole mapping DMA SGL a normal umem sgl is OK and the normal umem XLT programming in mlx5 is fine. > > > > The dma_list is used by both "polulate_mtt()" and "mlx5_ib_invalidate_range", which are used for XLT programming and invalidating > (zapping), respectively. > > > > > > > > Jason > > _______________________________________________ > > dri-devel mailing list > > dri-devel@xxxxxxxxxxxxxxxxxxxxx > > https://lists.freedesktop.org/mailman/listinfo/dri-devel > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch