On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@xxxxxxx> wrote: > On Sun 13-08-17 13:31:45, Dan Williams wrote: >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@xxxxxx> wrote: >> > Thay being said I think we absolutely should support RDMA memory >> > registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE >> > helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure >> > all the blocks are polulated and all ptes are set up. Second we need >> > to make sure get_user_page works, which for now means we'll need a >> > struct page mapping for the region (which will be really annoying >> > for PCIe mappings, like the upcoming NVMe persistent memory region), >> > and we need to gurantee that the extent mapping won't change while >> > the get_user_pages holds the pages inside it. I think that is true >> > due to side effects even with the current DAX code, but we'll need to >> > make it explicit. And maybe that's where we need to converge - >> > "sealing" the extent map makes sense as such a temporary measure >> > that is not persisted on disk, which automatically gets released >> > when the holding process exits, because we sort of already do this >> > implicitly. It might also make sense to have explicitl breakable >> > seals similar to what I do for the pNFS blocks kernel server, as >> > any userspace RDMA file server would also need those semantics. >> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to: >> >> 1/ only succeed if the fault can be satisfied without page cache >> >> 2/ only install a pte for the fault if it can do so without >> triggering block map updates >> >> So, I think it would still end up setting an inode flag to make >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping >> active. However, it would not record that state in the on-disk >> metadata and it would automatically clear at munmap time. That should >> be enough to support the host-persistent-memory, and >> NVMe-persistent-memory use cases (provided we have struct page for >> NVMe). Although, we need more safety infrastructure in the NVMe case >> where we would need to software manage I/O coherence. > > Hum, this proposal (and the problems you are trying to deal with) seem very > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to > the DAX area (and so additionally complicated by the fact that filesystems > now have to care). The patch set was not merged due to lack of interest I > think but it looked sensible and the proposed API would make sense for more > stuff than just DAX so maybe it would be better than MAP_DIRECT flag? Interesting, but I'm not sure I see the correlation. mm_mpin() makes a "no-fault" guarantee and fixes the accounting of locked System RAM. MAP_DIRECT still allows faults, and DAX mappings don't consume System RAM so the accounting problem is not there for DAX. mm_pin() also does not appear to have a relationship to a file backed memory like mmap allows.