> -----Original Message----- > From: Jason Gunthorpe <jgg@xxxxxxxx> > Sent: Thursday, October 15, 2020 5:54 PM > To: Xiong, Jianxin <jianxin.xiong@xxxxxxxxx> > Cc: linux-rdma@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Doug Ledford <dledford@xxxxxxxxxx>; Leon Romanovsky > <leon@xxxxxxxxxx>; Sumit Semwal <sumit.semwal@xxxxxxxxxx>; Christian Koenig <christian.koenig@xxxxxxx>; Vetter, Daniel > <daniel.vetter@xxxxxxxxx> > Subject: Re: [PATCH v5 4/5] RDMA/mlx5: Support dma-buf based userspace memory region > //snip > > +static int mlx5_ib_umem_dmabuf_xlt_init(struct ib_umem *umem, void > > +*context) { > > + struct mlx5_ib_mr *mr = context; > > + int flags = MLX5_IB_UPD_XLT_ENABLE; > > + > > + if (!mr) > > + return -EINVAL; > > + > > + return mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT, flags); } > > > +static int mlx5_ib_umem_dmabuf_xlt_update(struct ib_umem *umem, void > > +*context) { > > + struct mlx5_ib_mr *mr = context; > > + int flags = MLX5_IB_UPD_XLT_ATOMIC; > > Why are these atomic? Why the strange coding style of declaring a variable? The atomic flag is copied from the odp code. I have verified that it is not necessary. I can remove the definition of 'flags' here. > > > + if (!mr) > > + return -EINVAL; > > Why can this happen? Will dma_buf call move_notify prior to dma_buf_map_attachment? There are locking problems if that happens. I agree this check is unnecessary. > > > + return mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT, flags); } > > + > > +static int mlx5_ib_umem_dmabuf_xlt_invalidate(struct ib_umem *umem, > > +void *context) { > > + struct mlx5_ib_mr *mr = context; > > + int flags = MLX5_IB_UPD_XLT_ZAP | MLX5_IB_UPD_XLT_ATOMIC; > > + > > + if (!mr) > > + return -EINVAL; > > + > > + return mlx5_ib_update_xlt(mr, 0, mr->npages, PAGE_SHIFT, flags); } > > + > > +static struct ib_umem_dmabuf_ops mlx5_ib_umem_dmabuf_ops = { > > + .init = mlx5_ib_umem_dmabuf_xlt_init, > > + .update = mlx5_ib_umem_dmabuf_xlt_update, > > + .invalidate = mlx5_ib_umem_dmabuf_xlt_invalidate, > > +}; > > I'm not really convinced these should be ops, this is usually a bad design pattern. > > Why do I need so much code to extract the sgl from the dma_buf? I would prefer the dma_buf layer simplify this, not by adding a wrapper > around it in the IB core code... > We just need a way to call a device specific function to update the NIC's translation table. I considered three ways: (1) ops registered with ib_umem_get_dmabuf; (2) a single function pointer registered with ib_umem_get_dmabuf; (3) a method in 'struct ib_device'. Option (1) was chosen here with no strong reason. We could consolidate the three functions of the ops into one, but then we will need to define commands or flags for different update operations. > > +struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 start, > > + u64 length, u64 virt_addr, > > + int dmabuf_fd, int access_flags, > > + struct ib_udata *udata) > > +{ > > + struct mlx5_ib_dev *dev = to_mdev(pd->device); > > + struct mlx5_ib_mr *mr = NULL; > > + struct ib_umem *umem; > > + int page_shift; > > + int npages; > > + int ncont; > > + int order; > > + int err; > > + > > + if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM)) > > + return ERR_PTR(-EOPNOTSUPP); > > + > > + mlx5_ib_dbg(dev, > > + "start 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x\n", > > + start, virt_addr, length, dmabuf_fd, access_flags); > > + > > + if (!mlx5_ib_can_load_pas_with_umr(dev, length)) > > + return ERR_PTR(-EINVAL); > > + > > + umem = ib_umem_dmabuf_get(&dev->ib_dev, start, length, dmabuf_fd, > > + access_flags, &mlx5_ib_umem_dmabuf_ops); > > + if (IS_ERR(umem)) { > > + mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem)); > > + return ERR_PTR(PTR_ERR(umem)); > > + } > > + > > + npages = ib_umem_num_pages(umem); > > + if (!npages) { > > + mlx5_ib_warn(dev, "avoid zero region\n"); > > + ib_umem_release(umem); > > + return ERR_PTR(-EINVAL); > > + } > > + > > + page_shift = PAGE_SHIFT; > > + ncont = npages; > > + order = ilog2(roundup_pow_of_two(ncont)); > > We still need to deal with contiguity here, this ncont/npages is just obfuscation. Since the pages can move, we can't take advantage of contiguity here. This handling is similar to the ODP case. The variables 'ncont' and 'page_shift' here are not necessary. They are kept just for the sake of signifying the semantics of the following functions that use them. > > I have a patch series that should get posted soon rewriting all of this stuff.. Great. This can be updated accordingly. > > > + mlx5_ib_dbg(dev, "npages %d, ncont %d, order %d, page_shift %d\n", > > + npages, ncont, order, page_shift); > > + > > + mr = alloc_mr_from_cache(pd, umem, virt_addr, length, ncont, > > + page_shift, order, access_flags); > > + if (IS_ERR(mr)) > > + mr = NULL; > > + > > + if (!mr) { > > + mutex_lock(&dev->slow_path_mutex); > > + mr = reg_create(NULL, pd, virt_addr, length, umem, ncont, > > + page_shift, access_flags, false); > > + mutex_unlock(&dev->slow_path_mutex); > > + } > > + > > + if (IS_ERR(mr)) { > > + err = PTR_ERR(mr); > > + goto error; > > + } > > + > > + mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key); > > + > > + mr->umem = umem; > > + set_mr_fields(dev, mr, npages, length, access_flags); > > After another series I have there will be three copies of this sequence :\ Maybe this can be made into a utility function? > > > + err = ib_umem_dmabuf_init_mapping(umem, mr); > > + if (err) { > > + dereg_mr(dev, mr); > > + return ERR_PTR(err); > > + } > > Did you test the page fault path at all? Looks like some xarray code is missing here, and this is also missing the related complex teardown > logic. > > Does this mean you didn't test the pagefault_dmabuf_mr() at all? Thanks for the hint. I was unable to get the test runs reaching the pagefault_dmabuf_mr() function. Now I have found the reason. Along the path of all the page fault handlers, the array "odp_mkeys" is checked against the mr key. Since the dmabuf mr keys are not in the list the handler is never called. On the other hand, it seems that pagefault_dmabuf_mr() is not needed at all. The pagefault is gracefully handled by retrying until the work thread finished programming the NIC. > > Jason