Re: question re: xfs inode to inode copy implementation

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Wed, 22 Apr 2015 23:40:16 -0700

On Thu, Apr 23, 2015 at 11:13:45AM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2015 at 05:44:26PM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 22, 2015 at 08:27:38AM +1000, Dave Chinner wrote:
> > > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@xxxxxxxxxxx wrote:
> > > > > Hello list,
> > > > > 
> > > > > I'm prototyping something like reflinks in xfs and was wondering if
> > > > > anyone could give me some pointers on the best way to duplicate the
> > > > 
> > > > Heh, funny, I'm working on that too...
> > > > 
> > > > > blocks of the shared inode at the reflink inode, the copy which must
> > > > > occur when breaking the link.
> > > > 
> > > > ...though I'm not sure what "the shared inode at the reflink inode" means.
> > > > Are there somehow three inodes involved with reflinking one file to another?
> > > > 
> > > > > It would be nice to do the transfer via the page cache after allocating
> > > > > the space at the desintation inode, but it doesn't seem like I can use
> > > > > any of the kernel helpers for copying the data via the address_space
> > > > > structs since I don't have a struct file on hand for the copy source.
> > > > > I'm doing this in xfs_file_open() so the only struct file I have is the
> > > > > file being opened for writing - the destination of the copy.
> > > > 
> > > > So you're cloning the entire file's contents (i.e. breaking the reflink) as
> > > > soon as the file is opened rw?
> > > > 
> > > > > What I do have on hand is the shared inode and the destination inode
> > > > > opened and ready to go, and the struct file for the destination.
> > > > 
> > > > The design I'm pursuing is different from yours, I think -- two files can use
> > > > the regular bmbt to point to the same physical blocks, and there's a per-ag
> > > > btree that tracks reference counts for physical extents.  What I'd like to do
> > > > for the CoW operation is to clone the page (somehow), change the bmbt mapping
> > > > to "delayed allocation", and let the dirty pages flush out like normal.
> > > > 
> > > > I haven't figured out /how/ to do this, mind you.  The rest of the bookkeeping
> > > > parts are already written, though.
> > > 
> > > My first thought on COW was to try to use the write path get_blocks
> > > callback to do all this.  i.e. in __xfs_get_blocks() detect that it
> > > is an overwrite of a shared extent, remove the shared extent
> > > reference and then convert it to delayed alloc extent. (i.e.
> > > xfs_iomap_overwrite_shared()). Then writeback will allocate new
> > > blocks for the data.
> > 
> > <nod> That was my first thought, too.  I was rather hoping that I could just
> > update the incore BMBT to kick off delayed allocation and hope that it flushes
> > everything to disk before anything can blow up. (Ha...)  But alas, I hit the
> > same conclusion that you'd have to allocate the new block, write it, and only
> > then ought you update the BMBT.
> > 
> > > The question, however, is how to do this in a manner such that
> > > crashing between the breaking of the shared reference and data
> > > writeback doesn't leave us with a hole instead of data. To deal with
> > > that, I think that we're going to have to break shared extents
> > > during writeback, not during the write. However, we are going to
> > > need a delalloc reservation to do that.
> > > 
> > > So I suspect we need a new type of extent in the in-core extent tree
> > > - a "delalloc overwrite" extent - so that when we map it in writeback
> > > we can allocate the new extent, do the write to it, and then on IO
> > > completion do the BMBT manipulation to break the shared reference
> > > and insert the new extent. That solves the atomicity problem, and it
> > > allows us to track COW data on a per-inode basis without having
> > > to care about all the other reflink contexts to that same data.
> > 
> > I think that'll work... in xfs_vm_writepage (more probably xfs_map_blocks) if
> > the refcount > 2, allocate a new block, insert a new delalloc-overwrite in-core

Speaking of which, should I add a XFS_DIFLAG_ to indicate that a file has
(or has had) reflinked blocks?  In theory this would save us a trip through
the reflinkbt for "normal" files when the reflink feature is set, but
we'd then have to maintain it (and repair would have to check it).

> > extent with the new block number and set a flag in the ioend to remind
> > ourselves to update the bookkeeping later.  During xfs_end_io if that flag is
> > set, commit the new in-core extent to disk, decrement the refcounts, and
> > free the block if the refcount is 1.
> 
> If we are going to track the overwrite in-core, then we are probably
> going to need some form of intent/done transaction structure so that
> we don't leak the allocated block if we crash before the completion
> runs and commits the extent swap. I'd prefer to do that than require
> on-disk state to prevent free space leakage in this case.
>
> We could, potentially, abuse the EFI for this. i.e. record an EFI
> for the extent in the allocation transaction, then in the completion
> record a matching EFD. That way recovery will free the allocated
> extent if we don't complete it....

Clever!  I was looking around to see if XFS had something that could
take care of cleaning up orphans like that.

Rather nice that the usual outcome to "I think I want ____ data structure" is
that someone already thought of it. :)

> > For O_DIRECT I suppose we could use a similar mechanism -- you'd
> > have to set up the delalloc-overwrite extent in
> > xfs_iomap_write_direct() and use xfs_end_io_direct_write() to
> > update the bmbt and decrement the refcounts in the same way as
> > above.
> 
> Effectively.
> 
> > Hm.  Not sure what'll happen if the write buffer or the block size aren't a
> > page size.  Will have to go figure out what XFS does to fill in the rest of a
> > block if you try to directio-write to less than a block.  Hoping it's less
> > weird than other things I've seen.
> 
> Oh, it's weird enough. We allow sector size alignment, but we
> serialise all unaligned DIO writes because the sub-block zeroing is
> a nightmare to co-ordinate properly. But, really, DIO to a reflink
> file is not a performant operation, so maybe we should just punt all
> writes to shared extent files to the buffered IO path and not have
> to care about COW during DIO writes?

Sure, I'll punt to buffered mode for now, just to get something working.
I can always come back to this later if I dare.

--D

> 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs