On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote: > The application does not need to know the storage address, it needs to > know that the storage address to file offset is fixed. With this > information it can make assumptions about the permanence of results it > gets from the kernel. Only if we clearly document that fact - and documenting the permanence is different from saying the block map won't change. > For example get_user_pages() today makes no guarantees outside of > "page will not be freed", It also makes the extremely important gurantee that the page won't _move_ - e.g. that we won't do a memory migration for compaction or other reasons. That's why for example RDMA can use to register memory and then we can later set up memory windows that point to this registration from userspace and implement userspace RDMA. > but with immutable files and dax you now > have a mechanism for userspace to coordinate direct access to storage > addresses. Those raw storage addresses need not be exposed to the > application, as you say it doesn't need to know that detail. MAP_SYNC > does not fully satisfy this case because it requires agents that can > generate MMU faults to coordinate with the filesystem. The file system is always in the fault path, can you explain what other agents you are talking about? > All I know is that SMB Direct for persistent memory seems like a > potential consumer. I know they're not going to use a userspace > filesystem or put an SMB server in the kernel. Last I talked to the Samba folks they didn't expect a userspace SMB direct implementation to work anyway due to the fact that libibverbs memory registrations interact badly with their fork()ing daemon model. That being said during the recent submission of the RDMA client code some comments were made about userspace versions of it, so I'm not sure if that opinion has changed in one way or another. Thay being said I think we absolutely should support RDMA memory registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure all the blocks are polulated and all ptes are set up. Second we need to make sure get_user_page works, which for now means we'll need a struct page mapping for the region (which will be really annoying for PCIe mappings, like the upcoming NVMe persistent memory region), and we need to gurantee that the extent mapping won't change while the get_user_pages holds the pages inside it. I think that is true due to side effects even with the current DAX code, but we'll need to make it explicit. And maybe that's where we need to converge - "sealing" the extent map makes sense as such a temporary measure that is not persisted on disk, which automatically gets released when the holding process exits, because we sort of already do this implicitly. It might also make sense to have explicitl breakable seals similar to what I do for the pNFS blocks kernel server, as any userspace RDMA file server would also need those semantics. Last but not least we have any interesting additional case for modern Mellanox hardware - On Demand Paging where we don't actually do a get_user_pages but the hardware implements SVM and thus gets fed virtual addresses directly. My head spins when talking about the implications for DAX mappings on that, so I'm just throwing that in for now instead of trying to come up with a solution.