[... figuring out how g_u_p() references can prevent freeing and re-using the underlying mapped pmem addresses given the lack of struct pages for the mapping] > I see three solutions here: > > 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide > the caller with the struct pages of the DRAM. Modify DAX to handle some > file pages being in the page cache, and make sure that we know whether > the PMEM or DRAM is up to date. This has the obvious downside that > get_user_pages() becomes slow. And serialize transitions and fs stores to pmem regions. And now storing to dram-fronted pmem goes through all the dirtying and writeback machinery. This sounds like a nightmare to me, to be honest. > 2. Modify filesystems that support DAX to handle pinning blocks. > Some filesystems (that support COW and snapshots) already support > reference-counting individual blocks. We may be ale to do better by > using a tree of pinned extents or something. This makes it much harder > to modify a filesystem to support DAX, and I don't see patches adding > this capability to ext2 being warmly welcomed. This seems.. doable? Recording the referenced pmem in free lists in the fs is fine as long as the pmem isn't modified until the references are released, right? Maybe in the allocator you skip otherwise free blocks if they intersect with the run time structure (rbtree of extents, presumably) that is taking the place of reference counts in struct page. There aren't *that* many allocator entry points. I guess you'd need to avoid other modifications of free space like trimming :/. It still seems reasonably doable? And hey, lord knows we love to implement rbtrees of extents in file systems! (btrfs: struct extent_state, ext4: struct extent_status) The tricky part would be maintaining that structure behind g_u_p() and put_page() calls. Probably a richer interface that gives callers something more than just raw page pointers. > 3. Make truncate() block if it hits a pinned page. There's really no > good reason to truncate a file that has pinned pages; it's either a bug > or you're trying to be nasty to someone. We actually already have code > for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't > just for O_DIRECT I/Os and other transient users like crypto, it's also > for long-lived things like RDMA, where we could potentially block for > an indefinite time. I have no concrete examples, but I agree that it sounds like the sort of thing that would bite us in the ass if we miss some use case :/. I guess my initial vote is for trying a less-than-perfect prototype of #2 to see just how hairy the rough outline gets. - z -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html