On Wed, Oct 08, 2014 at 04:21:32PM -0700, Zach Brown wrote: > [... figuring out how g_u_p() references can prevent freeing and > re-using the underlying mapped pmem addresses given the lack of struct > pages for the mapping] > > > I see three solutions here: > > > > 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide > > the caller with the struct pages of the DRAM. Modify DAX to handle some > > file pages being in the page cache, and make sure that we know whether > > the PMEM or DRAM is up to date. This has the obvious downside that > > get_user_pages() becomes slow. > > And serialize transitions and fs stores to pmem regions. And now > storing to dram-fronted pmem goes through all the dirtying and writeback > machinery. This sounds like a nightmare to me, to be honest. That's not so bad ... it's just normal page-cache stuff, really. It'd be per-page serialisation, just like the current gunk we go through to get sparse loads to not allocate backing store. > > 2. Modify filesystems that support DAX to handle pinning blocks. > > Some filesystems (that support COW and snapshots) already support > > reference-counting individual blocks. We may be ale to do better by > > using a tree of pinned extents or something. This makes it much harder > > to modify a filesystem to support DAX, and I don't see patches adding > > this capability to ext2 being warmly welcomed. > > This seems.. doable? Recording the referenced pmem in free lists in the > fs is fine as long as the pmem isn't modified until the references are > released, right? As long as it's not *allocated* to anything else (which seems to be what you're actually saying in the next paragraph). > Maybe in the allocator you skip otherwise free blocks if they intersect > with the run time structure (rbtree of extents, presumably) that is > taking the place of reference counts in struct page. There aren't > *that* many allocator entry points. I guess you'd need to avoid other > modifications of free space like trimming :/. It still seems reasonably > doable? Ah, so on reboot, the on-disk data structures are all correct, and the in-memory data structures went away with the runtime pinning of the memory. Nice. > And hey, lord knows we love to implement rbtrees of extents in file > systems! (btrfs: struct extent_state, ext4: struct extent_status) > > The tricky part would be maintaining that structure behind g_u_p() and > put_page() calls. Probably a richer interface that gives callers > something more than just raw page pointers. > > > 3. Make truncate() block if it hits a pinned page. There's really no > > good reason to truncate a file that has pinned pages; it's either a bug > > or you're trying to be nasty to someone. We actually already have code > > for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't > > just for O_DIRECT I/Os and other transient users like crypto, it's also > > for long-lived things like RDMA, where we could potentially block for > > an indefinite time. > > I have no concrete examples, but I agree that it sounds like the sort of > thing that would bite us in the ass if we miss some use case :/. > > I guess my initial vote is for trying a less-than-perfect prototype of > #2 to see just how hairy the rough outline gets. Thinking about it now, it seems less hairy than I initially thought. I'll give it a quick try and see how it goes. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html