Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 10 Oct 2019 18:30:20 +1100

On Wed, Oct 09, 2019 at 10:11:52AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 08, 2019 at 11:31:44PM -0700, Christoph Hellwig wrote:
> > Btw, I just had a chat with Dan last week on this.  And he pointed out
> > that while this series deals with the read/write path issues of 
> > reflink on DAX it doesn't deal with the mmap side issue that
> > page->mapping and page->index can point back to exactly one file.
> > 
> > I think we want a few xfstests that reflink a file and then use the
> > different links using mmap, as that should blow up pretty reliably.
> 
> Hmm, you're right, we don't actually have a test that checks the
> behavior of mwriting all copies of a shared block.  Ok, I'll go write
> one.

I've pointed this problem out to everyone who has asked me "what do
we need to do to support reflink on DAX". I've even walked a couple
of people right through the problem that needs to be solved and
discussed the potential solutions to it.

Problems that I think need addressing:

	- device dax and filesystem dax have fundamentally different
	  needs in this space, so they need to be separated and not
	  try to use the same solution.
	- dax_lock_entry() being used as a substitute for
	  page_lock() but it not being held on the page itself means
	  it can't be extended to serialise access to the page
	  across multiple mappings that are unaware of each other
	- dax_lock_page/dax_unlock_page interface for hardware
	  memory errors needs to report to the
	  filesystem for processing and repair, not assume the page
	  is user data and killing processes is the only possible
	  recovery mechanism.
	- dax_associate_entry/dax_disassociate_entry can only work
	  for a 1:1 page:mapping,index relationship. It needs to go
	  away and be replaced by a mechanism that allows
	  tracking multiple page mapping/index/state tuples. This
	  has much wider use than DAX (e.g. sharing page cache pages
	  between reflinked files)

I've proposed shadow pages (based on a concept from Matethw Wilcox)
for each read-only reflink mapping with the real physical page being
owned by the filesystem and indexed by LBA in the filesystem buffer
cache. This would be based on whether the extent in the file the
page is mapped from has multiple references to it.

i.e. When a new page mapping occurs in a shared extent, we add the
page to the buffer cache (i.e. point a struct xfs_buf at it)i if it
isn't already present, then allocate a shadow page, point it at the
master, set it up with the new mapping,index tuple and add it to the
mapping tree. Then we can treat it as a unique page even though it
points to the read-only master page.

When the page get's COWed, we toss away the shadow page and the
master can be reclaimed with the reference count goes to zero or the
extent is no longer shared.  Think of it kind of like the way we
multiply reference the zero page for holes in mmap()d dax regions,
except we can have millions of them and they are found by physical
buffer cache index lookups. 

This works for both DAX and non-DAX sharing of read-only shared
filesytsem pages. i.e. it would form the basis of single-copy
read-only page cache pages for reflinked files.

There was quite a bit of talk at LSFMM 2018 about having a linked
list of mapping structures hanging off a struct page, one for each
mapping that references the page. Operations would then have to walk
all mappings that reference the page. This was useful to other
subsystems (HMM?) for some purpose I forget, but I'm not sure it's
particularly useful by itself for non-dax reflink purposes - I
suspect the filesystem would still need to track such pages itself
in it's buffer cache so it can find the cached page to link new
reflink copies to the same page...

ISTR a couple of other solutions were thrown around, but I don't
think anyone came up with a simple solution...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx