[PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Thu, 16 Jun 2016 18:17:53 -0700

Hi all,

This is the sixth revision of a patchset that adds to XFS kernel
support for tracking reverse-mappings of physical blocks to file and
metadata (rmap); support for mapping multiple file logical blocks to
the same physical block (reflink); and implements the beginnings of
online metadata scrubbing.  Given the significant amount of design
assumptions that change with block sharing, rmap and reflink are
provided together.  There shouldn't be any incompatible on-disk format
changes, pending a thorough review of the patches within.

The reverse mapping implementation features a simple per-AG b+tree
containing tuples of (physical block, owner, offset, blockcount) with
the key being the first three fields.  The large record size will
enable us to reconstruct corrupt block mapping btrees (bmbt); the
large key size is necessary to identify uniquely each rmap record in
the presence of shared physical blocks.  In contrast to previous
iterations of this patchset, it is no longer a requirement that there
be a 1:1 correspondence between bmbt and rmapbt records; each rmapbt
record can cover multiple bmbt records.

The reflink implementation features a simple per-AG b+tree containing
tuples of (physical block, blockcount, refcount) with the key being
the physical block.  Copy on Write (CoW) is implemented by creating a
separate CoW fork and using the existing delayed allocation mechanism
to try to allocate as large of a replacement extent as possible before
committing the new data to media.  A CoW extent size hint allows
administrators to influence the size of the replacement extents, and
certain writes can be "promoted" to CoW when it would be advantageous
to reduce fragmentation.  The userspace interface to reflink and
dedupe are the VFS FICLONE, FICLONERANGE, and FIDEDUPERANGE ioctls,
which were previously private to btrfs.

Since the previous posting, I have made some major changes to the
underlying XFS common code.  First, I have extended the generic b+tree
implementation to support overlapping intervals, which is necessary
for the rmapbt on a reflink filesystem where there can be a number of
rmapbt records representing a physical block.  The new b+tree variant
introduces the notion of a "high key" for each record; it is the
highest key that can be used to identify a record.  On disk, an
overlapped-interval b+tree looks like a traditional b+tree except that
nodes store both the lowest key and the highest key accessible through
that subtree pointer.  There's a new interval query function that uses
both keys to iterate all records overlapping a given range of keys.
This change allows us to remove the old requirement that each bmbt
record correspond to a matching rmapbt record.

The second big change is to the xfs_bmap_free functions.  The existing
code implements a mechanism to defer metadata (specifically, free
space b+tree) updates across a transaction commit by logging redo
items that can be replayed during recovery.  It is an elegant way to
avoid running afoul of AG locking order rules /and/ it can in theory
be used to get around running out of transaction reservation.  That
said, I have refactored it into a generic "deferred operations"
mechanism that can defer arbitrary types of work to a subsequent
rolled transaction.  The framework thus allows me to schedule rmapbt,
refcountbt, and bmbt updates while maintaining correct redo in case of
failure.  Remapping activities for reflink and CoW are now atomic.

The third big change is the establishment of a per-AG block
reservation mechanism.  This "hides" some blocks from the regular
block allocator; refcountbt and rmapbt expansions use these blocks to
handle the removal of the assumption that file mapping operations
always involve block allocation.  This gets us into trouble when a
file allocates an entire AG, is reflinked by other files, and
subsequent CoWs cause record splits in the rmap and reflink btrees.

At the very end of the patchset is an initial implementation of a
GETFSMAPX ioctl for userland to query the physical block mapping of a
filesystem; and metadata scrubbing for XFS.  The scrubber iterates
the per-AG btrees and does some simple cross-checking when possible;
I built it to check the functionality of the new b+tree code.

The first few patches fix various vfs/xfs bugs, adds an enhancement to
the xfs_buf tracepoints so that we can analyze buffer deadlocks, and
merges difference between the kernel and userspace libxfs so that the
rest of the patches apply consistently.

There are still two functionality gaps: the extent swap ioctl isn't
functional when rmap is enabled; and rmap cannot (yet) coexist with
realtime devices.  These will be addressed in the next sprint.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
There are also updates for xfs-docs[4].  The kernel patches should
apply to dchinner's for-next; xfsprogs patches to for-next; and
xfstest to master.  NOTE however that the kernel git tree already has
the five for-next patches included.

The patches have been xfstested with x64, i386, and armv7l--arm64,
ppc64, and ppc64le no longer boot in qemu.  All three architectures
pass all 'clone' group tests except xfs/128 (which is the swapext
test), and AFAICT don't cause any new failures for the 'auto' group.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/djwong-devel
[2] https://github.com/djwong/xfsprogs/tree/djwong-devel
[3] https://github.com/djwong/xfstests/tree/djwong-devel
[4] https://github.com/djwong/xfs-documentation/tree/djwong-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html