On Tue, June 13, 2023 1:23 AM Jason Gunthorpe wrote: > On Thu, May 18, 2023 at 05:21:51PM +0900, Daisuke Matsuda wrote: > > > +/* umem mutex must be locked before entering this function. */ > > +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags) > > +{ > > + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem); > > + const int max_tries = 3; > > + int cnt = 0; > > + > > + int err; > > + u64 perm; > > + bool need_fault; > > + > > + if (unlikely(length < 1)) { > > + mutex_unlock(&umem_odp->umem_mutex); > > + return -EINVAL; > > + } > > + > > + perm = ODP_READ_ALLOWED_BIT; > > + if (!(flags & RXE_PAGEFAULT_RDONLY)) > > + perm |= ODP_WRITE_ALLOWED_BIT; > > + > > + /* > > + * A successful return from rxe_odp_do_pagefault() does not guarantee > > + * that all pages in the range became present. Recheck the DMA address > > + * array, allowing max 3 tries for pagefault. > > + */ > > + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp, > > + iova, length, perm))) { > > + if (cnt >= max_tries) > > + break; > > I don't think this makes sense.. I have posted the new implementation, but it is somewhat different from your suggestion. The reason is as below. > > You need to make this work more like mlx5 does, you take the spinlock > on the xarray, you search it for your index and whatever is there > tells what to do. Hold the spinlock while doing the copy. This is > enough locking for the fast path. This lock should be umem mutex. Actual page-out is executed in the kernel after rxe_ib_invalidate_range() is done. It is possible the spinlock does not block this flow if it is locked while the invalidation is running. The umem mutex can serialize these accesses. > > If there is no index present, or it is not writable and you need > write, then you unlock the spinlock and prefetch the missing entry and > try again, this time also holding the mutex so there isn't a race. > > You shouldn't probe into parts of the umem to discover information > already stored in the xarray then do the same lookup into the xarray. > > IIRC this also needs to keep track in the xarray on a per page basis > if the page is writable. An xarray entry can hold a pointer or a value from 0 to LONG_MAX. That is not enough to store page address and its permission. If we try to do everything with xarray, we need to allocate a new struct for each page that holds a pointer to a page and a value to store r/w permission. That is inefficient in terms of memory usage and implementation. I think the xarray can be used to check presence of pages just like we have been doing in the non-ODP case. On the other hand, the permission should be fetched from umem_odp->pfn_list, which is updated everytime page fault is executed. Thanks, Daisuke > > Jason