Re: [RFCv4 00/76] xfs: add reverse-mapping, reflink, and dedupe support

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 5 Jan 2016 18:04:40 -0800

On Tue, Jan 05, 2016 at 07:42:26AM -0500, Brian Foster wrote:
> On Mon, Jan 04, 2016 at 03:59:51PM -0800, Darrick J. Wong wrote:
> > On Sun, Dec 20, 2015 at 09:02:54AM -0500, Brian Foster wrote:
> > > On Sat, Dec 19, 2015 at 12:56:23AM -0800, Darrick J. Wong wrote:
> > > > Hi all,
> > > > 
> > > ...
> > > > Fixed since RFCv3:
> > > > 
> > > >  * The reflink and dedupe ioctls are being hoisted to the VFS, as
> > > >    provided in the first few patches.  Patch 81 connects to this
> > > >    functionality.
> > > > 
> > > >  * Copy on write has been rewritten for v4.  We now use the existing
> > > >    delayed allocation mechanism to coalesce writes together, deferring
> > > >    allocation until writeout time.  This enables CoW to make better
> > > >    block placement decisions and significantly reduces overhead.
> > > >    CoW is still pretty slow, but not as slow as before.
> > > > 
> > > >  * Direct IO CoW has been implemented using the same mechanism as
> > > >    above, but modified to perform the allocation and remapping right
> > > >    then and there.  Throughput is much higher than pushing data
> > > >    through the page cache CoW.  (It's the same mechanism, but we're
> > > >    playing with chunks bigger than a single memory page.)
> > > > 
> > > >  * CoW ENOSPC works correctly now, except in the pathological case
> > > >    that the AG fills up and the rmap btree cannot expand.  That will
> > > >    be addressed for v5.
> > > > 
> > > >  * fallocate will now unshare blocks to prevent future ENOSPC, as
> > > >    you'd expect.
> > > > 
> > > >  * refcount btree blocks are preallocated at mount time to prevent
> > > >    ENOSPC while trying to expand the tree.  This also has the effect
> > > >    of grouping the btree blocks together, which can speed up CoW
> > > >    remapping.
> > > > 
> > > 
> > > Can you elaborate on how these blocks are preallocated? E.g., is the
> > > tree "preconstructed" in some sense? However that is done, is this the
> > > anticipated solution or a temporary workaround..?
> > > 
> > > Also, shouldn't the enospc condition be handled by the agfl? I take it
> > > there is something going on here that renders that solution flawed, so
> > > I'm just curious what it is.
> > > 
> > > (Sorry if this is all explained elsewhere, but I haven't yet had a
> > > chance to take a close enough look at this feature..).
> > 
> > Reference count btree blocks aren't allocated from the AGFL; they're allocated
> > from the free space in the same manner as the inobt, per a review comment from
> > Dave a looong time ago. :) 
> > 
> 
> Ah, Ok.
> 
> > As such, we can get ourselves into the nasty situation where every block in the
> > AG has been allocated to file data.  If we then see a bunch of reference count
> > changes that are scattered around the AG, the reference count btree has to
> > expand to hold all the new records... but there isn't space, and the operation
> > fails.  Given that we know the maximum possible size of the refcount btree
> > (it's 0.3% of the AG size with 4k blocks), I figured it was easy enough to
> > avoid ENOSPC for reflink operations.
> > 
> 
> Sounds reasonable.
> 
> > I've temporarily fixed this by adding code that figures out how many blocks we
> > need if the reference count btree has to have a unique record for every block
> > in the AG and holding that many blocks until either they're allocated to the
> > refcount btree or freed at umount time.  Right now it's a temporary fix (if the
> > FS crashes, the reserved blocks are lost) but it wouldn't be difficult for the
> > FS to make a permanent reservation that's recorded on disk somehow.  But that's
> > involves writing things to disk + making xfsprogs understand the reservation;
> > let's see what people say about the reserved pool idea at all.
> > 
> > Does that make sense? :)
> > 
> 
> Yep, it sounds sort of like the reserve pool mechanism used to protect
> against ENOSPC when freeing blocks. Curious... why are the reserved
> blocks lost on fs crash? Wouldn't they be reserved again on the
> subsequent mount?

They will, but the pre-crash reservation isn't (yet) written down anywhere on
disk.

Thank /you/ for having a look at the reflink code! :)

--D

> 
> Thanks for the explanation...
> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > Issues: 
> > > > 
> > > >  * The extent swapping ioctl still allocates a bigger fixed-size
> > > >    transaction.  That's most likely a stupid thing to do, so getting a
> > > >    better grip on how the journalling code works and auditing all the
> > > >    new transaction users will have to happen.  Right now it mostly
> > > >    gets lucky.
> > > > 
> > > >  * EFI tracking for the allocated-but-not-yet-mapped blocks is
> > > >    nonexistant.  A crash will leak them.
> > > > 
> > > >  * ENOSPC while expanding the rmap btree can crash the FS.  For now we
> > > >    work around this problem by making the AGFL as big as possible,
> > > >    failing CoW attempts with ENOSPC if there aren't enough AGFL blocks
> > > >    available, and hoping that doesn't actually happen.
> > > > 
> > > > If you're going to start using this mess, you probably ought to just
> > > > pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
> > > > There are also updates for xfs-docs[4] and man-pages[5].
> > > > 
> > > > The patches have been xfstested with x64, i386, and ppc64; while in
> > > > general the tests run to completion, there are still periodic bugs
> > > > that will be addressed by the next RFC.  There's a persistent crash on
> > > > arm64 and ppc64el that I haven't been able to triage.
> > > > 
> > > > This is an extraordinary way to eat your data.  Enjoy! 
> > > > Comments and questions are, as always, welcome.
> > > > 
> > > > --D
> > > > 
> > > > [1] https://github.com/djwong/linux/tree/for-dave
> > > > [2] https://github.com/djwong/xfsprogs/tree/for-dave
> > > > [3] https://github.com/djwong/xfstests/tree/for-dave
> > > > [4] https://github.com/djwong/xfs-documentation/tree/for-dave
> > > > [5] https://github.com/djwong/man-pages/commits/for-mtk
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@xxxxxxxxxxx
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@xxxxxxxxxxx
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs