Re: direct_access, pinning and truncation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 08, 2014 at 04:21:32PM -0700, Zach Brown wrote:
> [... figuring out how g_u_p() references can prevent freeing and
> re-using the underlying mapped pmem addresses given the lack of struct
> pages for the mapping]
> 
> > I see three solutions here:
> > 
> > 1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
> > the caller with the struct pages of the DRAM.  Modify DAX to handle some
> > file pages being in the page cache, and make sure that we know whether
> > the PMEM or DRAM is up to date.  This has the obvious downside that
> > get_user_pages() becomes slow.
> 
> And serialize transitions and fs stores to pmem regions.  And now
> storing to dram-fronted pmem goes through all the dirtying and writeback
> machinery.  This sounds like a nightmare to me, to be honest.

That's not so bad ... it's just normal page-cache stuff, really.  It'd be
per-page serialisation, just like the current gunk we go through to get
sparse loads to not allocate backing store.

> > 2. Modify filesystems that support DAX to handle pinning blocks.
> > Some filesystems (that support COW and snapshots) already support
> > reference-counting individual blocks.  We may be ale to do better by
> > using a tree of pinned extents or something.  This makes it much harder
> > to modify a filesystem to support DAX, and I don't see patches adding
> > this capability to ext2 being warmly welcomed.
> 
> This seems.. doable?  Recording the referenced pmem in free lists in the
> fs is fine as long as the pmem isn't modified until the references are
> released, right?

As long as it's not *allocated* to anything else (which seems to be what
you're actually saying in the next paragraph).

> Maybe in the allocator you skip otherwise free blocks if they intersect
> with the run time structure (rbtree of extents, presumably) that is
> taking the place of reference counts in struct page.  There aren't
> *that* many allocator entry points.  I guess you'd need to avoid other
> modifications of free space like trimming :/.  It still seems reasonably
> doable?

Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory.  Nice.

> And hey, lord knows we love to implement rbtrees of extents in file
> systems!  (btrfs: struct extent_state, ext4: struct extent_status)
> 
> The tricky part would be maintaining that structure behind g_u_p() and
> put_page() calls.  Probably a richer interface that gives callers
> something more than just raw page pointers.
> 
> > 3. Make truncate() block if it hits a pinned page.  There's really no
> > good reason to truncate a file that has pinned pages; it's either a bug
> > or you're trying to be nasty to someone.  We actually already have code
> > for this; inode_dio_wait() / inode_dio_done().  But page pinning isn't
> > just for O_DIRECT I/Os and other transient users like crypto, it's also
> > for long-lived things like RDMA, where we could potentially block for
> > an indefinite time.
> 
> I have no concrete examples, but I agree that it sounds like the sort of
> thing that would bite us in the ass if we miss some use case :/.
> 
> I guess my initial vote is for trying a less-than-perfect prototype of
> #2 to see just how hairy the rough outline gets.

Thinking about it now, it seems less hairy than I initially thought.  I'll
give it a quick try and see how it goes.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux