Jason Gunthorpe <jgg@xxxxxxxx> writes: > On Thu, May 09, 2024 at 11:31:33AM -0600, Logan Gunthorpe wrote: >> Hi Jason, >> >> We've become interested again in enabling P2PDMA transactions with >> userspace RDMA and the NVMe CMBs we are already exporting to userspace >> from our previous work. >> >> Enabling FOLL_PCI_P2PDMA in ib_umem_get is almost a trivial change, but >> there are two issues holding us back. >> >> The biggest issue is that we disallowed FOLL_LONGTERM with >> FOLL_PCI_P2PDMA out of concern that P2PDMA had the same problem as >> fs-dax. > > Yeah, it was not a great outcome of that issue. I recently came across this trying to do something similar with io_uring to mapped P2PDMA pages for testing purposes. I commented out the FOLL_PCI_P2PDMA check but it hung/stalled during process teardown so there might be something there. >> See [1] to review the discussion from 2 years ago. However, in >> trying to understand the problem again, I'm not sure that concern was >> valid. In P2PDMA, unmap_mapping_range() is strictly only called on >> driver unbind when everything is getting torn down[2]. The next thing >> that happens immediately after the unmap is the tear down of the pgmap >> which drops the elevated reference on the pages and waits for all page's >> reference counts to go back to zero. This will effectively wait until >> all longterm pins involving the memory have been released. This can >> cause a hang on unbind but, in your words, its "annoying not critical". > > Yes > > But you are looking at the code as it is right now, and stuff has been > quitely fixed with the pgmap refcount area since. I think it is > probably good now. IIRC it was pushed over the finish line when the > ZONE_DEVICE/PRIVATE pages were converted to have normal reference > counting. But P2P DMA pages are still in the off-by-one reference counting scheme. My RFR series[1] (aims) to fix that, although the pgmap refcounting scheme needs a closer look because I think doing the page refcounting properly allows some cleanups there. [1] - https://lore.kernel.org/linux-mm/cover.fe275e9819458a4bbb9451b888cafb88af8867d4.1712796818.git-series.apopple@xxxxxxxxxx/ > If p2p is following the new ZONE_DEVICE scheme then it should be fine. > > It would be good to read over Alistair's latest series fixing up fsdax > refcounts to see if anything pops out as problematic specifically with > the P2P case. > > Otherwise a careful check through is probably all that is needed. > >> The other issue we hit when enabling this feature is the check for >> vma_needs_dirty_tracking() in writable_file_mapping_allowed() during the >> gup flow. This hits because the p2pdma code is using the common >> sysfs/kernfs infrastructure to create the VMA which installs a >> page_mkwrite operator()[4] to change the file update time on write. > > Ah. > >> I don't think this feature really makes any sense for the P2PDMA >> sysfs file which is really operating as an allocator in userspace -- >> the time on the file does not really need to reflect the last write >> of some process that wrote to memory allocated using it. > > Right, you shouldn't have mkwrite for these pages. > > Jason