On Wed, Oct 12, 2016 at 11:18:49PM +1100, Dave Chinner wrote: > Hi Linus, > > This is the second part of the XFS updates for this merge cycle. > This pullreq contains the new shared data extents feature for XFS, > and can be found at: > > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git tags/xfs-reflink-for-linus-4.9-rc1 > > The full pull request output is below. > > Given the complexity and size of this change I am expecting - like > the addition of reverse mapping last cycle - that there will be some > follow-up bug fixes and cleanups around the -rc3 stage for issues > that I'm sure will show up once the code hits a wider userbase. > > What it is: > > At the most basic level we are simply adding shared data extents to > XFS - i.e. a single extent on disk can now have multiple owners. To > do this we have to add new on-disk features to both track the shared > extents and the number of times they've been shared. This is done by > the new "refcount" btree that sits in every allocation group. When > we share or unshare an extent, this tree gets updated. > > Along with this new tree, the reverse mapping tree needs to be > updated to track each owner or a shared extent. This also needs to > be updated ever share/unshare operation. These interactions at > extent allocation and freeing time have complex ordering and > recovery constraints, so there's a significant amount of new > intent-based transaction code to ensure that operations are > performed atomically from both the runtime and integrity/crash > recovery perspectives. > > We also need to break sharing when writes hit a shared extent - this > is where the new copy-on-write implementation comes in. We allocate > new storage and copy the original data along with the overwrite data > into the new location. We only do this for data as we don't share > metadata at all - each inode has it's own metadata that tracks the > shared data extents, the extents undergoing CoW and it's own private > extents. > > Of course, being XFS, nothing is simple - we use delayed allocation > for CoW similar to how we use it for normal writes. ENOSPC is a > significant issue here - we build on the reservation code added > in 4.8-rc1 with the reverse mapping feature to ensure we don't get > spurious ENOSPC issues part way through a CoW operation. These > mechanisms also help minimise fragmentation due to repeated CoW > operations. To further reduce fragmentation overhead, we've also > introduced a CoW extent size hint, which indicates how large a > region we should allocate when we execute a CoW operation. > > With all this functionality in place, we can hook up > .copy_file_range, .clone_file_range and .dedupe_file_range and we > gain all the capabilities of reflink and other vfs provided > functionality that enable manipulation to shared extents. We also > added a fallocate mode that explicitly unshares a range of a file, > which we implemented as an explicit CoW of all the shared extents in > a file. > > As such, it's a huge chunk of new functionality with new on-disk > format features and internal infrastructure. It warns at mount time > as an experimental feature and that it may eat data (as we do with > all new on-disk features until they stabilise). We have not > released userspace suport for it yet - userspace support currently > requires download from Darrick's xfsprogs repo and build from > source, so the access to this feature is really developer/tester > only at this point. Initial userspace support will be released at > the same time the kernel with this code in it is released. Userland support is in this branch: https://github.com/djwong/xfsprogs/tree/for-dave-for-4.9-15 There will undoubtedly be more of these since Dave will libxfs-apply the kernel patches into for-next after the merge window closes, after which I'll rebase the tool patches against that. > The new code causes 5-6 new failures with xfstests - these aren't > serious functional failures but things the output of tests changing > slightly due to perturbations in layouts, space usage, etc. OTOH, > we've added 150+ new tests to xfstests that specifically exercise > this new functionality so it's got far better test coverage than any > functionality we've previously added to XFS. https://github.com/djwong/xfstests/tree/djwong-devel have fixes to some of the tests tests, if you dare. :) I'll resync with upstream the next time I see a xfstests.git update. (Merge window is open, so I don't anticipate that until next week.) > Darrick has done a pretty amazing job getting us to this stage, and > special mention also needs to go to Christoph (review, testing, > improvements and bug fixes) and Brian (caught several intricate > bugs during review) for the effort they've also put in. Yes, my hearty thanks to Dave, Christoph, and Brian for their support! --D > > Thanks, > > -Dave. > > ---------- > The following changes since commit 155cd433b516506df065866f3d974661f6473572: > > Merge branch 'xfs-4.9-log-recovery-fixes' into for-next (2016-10-03 09:56:28 +1100) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git tags/xfs-reflink-for-linus-4.9-rc1 > > for you to fetch changes up to feac470e3642e8956ac9b7f14224e6b301b9219d: > > xfs: convert COW blocks to real blocks before unwritten extent conversion (2016-10-11 09:03:19 +1100) > > ---------------------------------------------------------------- > xfs: reflink update for 4.9-rc1 > > < XFS has gained super CoW powers! > > ---------------------------------- > \ ^__^ > \ (oo)\_______ > (__)\ )\/\ > ||----w | > || || > > Included in this update: > - unshare range (FALLOC_FL_UNSHARE) support for fallocate > - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface > - shared extent support for XFS > - copy-on-write support for shared extents > - copy_file_range support > - clone_file_range support (implements reflink) > - dedupe_file_range support > - defrag support for reverse mapping enabled filesystems > > ---------------------------------------------------------------- > Christoph Hellwig (1): > xfs: convert COW blocks to real blocks before unwritten extent conversion > > Darrick J. Wong (70): > vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint > vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks > xfs: return an error when an inline directory is too small > xfs: define tracepoints for refcount btree activities > xfs: introduce refcount btree definitions > xfs: refcount btree add more reserved blocks > xfs: define the on-disk refcount btree format > xfs: add refcount btree support to growfs > xfs: account for the refcount btree in the alloc/free log reservation > xfs: add refcount btree operations > xfs: create refcount update intent log items > xfs: log refcount intent items > xfs: adjust refcount of an extent of blocks in refcount btree > xfs: connect refcount adjust functions to upper layers > xfs: adjust refcount when unmapping file blocks > xfs: add refcount btree block detection to log recovery > xfs: reserve AG space for the refcount btree root > xfs: introduce reflink utility functions > xfs: create bmbt update intent log items > xfs: log bmap intent items > xfs: map an inode's offset to an exact physical block > xfs: pass bmapi flags through to bmap_del_extent > xfs: implement deferred bmbt map/unmap operations > xfs: when replaying bmap operations, don't let unlinked inodes get reaped > xfs: return work remaining at the end of a bunmapi operation > xfs: define tracepoints for reflink activities > xfs: add reflink feature flag to geometry > xfs: don't allow reflinked dir/dev/fifo/socket/pipe files > xfs: introduce the CoW fork > xfs: support bmapping delalloc extents in the CoW fork > xfs: create delalloc extents in CoW fork > xfs: support allocating delayed extents in CoW fork > xfs: allocate delayed extents in CoW fork > xfs: support removing extents from CoW fork > xfs: move mappings from cow fork to data fork after copy-write > xfs: report shared extent mappings to userspace correctly > xfs: implement CoW for directio writes > xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks > xfs: cancel pending CoW reservations when destroying inodes > xfs: store in-progress CoW allocations in the refcount btree > xfs: reflink extents from one file to another > xfs: add clone file and clone range vfs functions > xfs: add dedupe range vfs function > xfs: teach get_bmapx about shared extents and the CoW fork > xfs: swap inode reflink flags when swapping inode extents > xfs: unshare a range of blocks via fallocate > xfs: create a separate cow extent size hint for the allocator > xfs: preallocate blocks for worst-case btree expansion > xfs: don't allow reflink when the AG is low on space > xfs: try other AGs to allocate a BMBT block > xfs: garbage collect old cowextsz reservations > xfs: increase log reservations for reflink > xfs: add shared rmap map/unmap/convert log item types > xfs: use interval query for rmap alloc operations on shared files > xfs: convert unwritten status of reverse mappings for shared files > xfs: set a default CoW extent size of 32 blocks > xfs: check for invalid inode reflink flags > xfs: don't mix reflink and DAX mode for now > xfs: simulate per-AG reservations being critically low > xfs: recognize the reflink feature bit > xfs: various swapext cleanups > xfs: refactor swapext code > xfs: implement swapext for rmap filesystems > xfs: check inode reflink flag before calling reflink functions > xfs: reduce stack usage of _reflink_clear_inode_flag > xfs: remove isize check from unshare operation > xfs: fix label inaccuracies > xfs: fix error initialization > xfs: clear reflink flag if setting realtime flag > xfs: rework refcount cow recovery error handling > > fs/open.c | 5 + > fs/xfs/Makefile | 7 + > fs/xfs/libxfs/xfs_ag_resv.c | 15 +- > fs/xfs/libxfs/xfs_alloc.c | 23 + > fs/xfs/libxfs/xfs_bmap.c | 575 +++++++++++- > fs/xfs/libxfs/xfs_bmap.h | 67 +- > fs/xfs/libxfs/xfs_bmap_btree.c | 18 + > fs/xfs/libxfs/xfs_btree.c | 8 +- > fs/xfs/libxfs/xfs_btree.h | 16 + > fs/xfs/libxfs/xfs_defer.h | 2 + > fs/xfs/libxfs/xfs_format.h | 97 +- > fs/xfs/libxfs/xfs_fs.h | 10 +- > fs/xfs/libxfs/xfs_inode_buf.c | 24 +- > fs/xfs/libxfs/xfs_inode_buf.h | 1 + > fs/xfs/libxfs/xfs_inode_fork.c | 70 +- > fs/xfs/libxfs/xfs_inode_fork.h | 28 +- > fs/xfs/libxfs/xfs_log_format.h | 118 ++- > fs/xfs/libxfs/xfs_refcount.c | 1698 ++++++++++++++++++++++++++++++++++++ > fs/xfs/libxfs/xfs_refcount.h | 70 ++ > fs/xfs/libxfs/xfs_refcount_btree.c | 451 ++++++++++ > fs/xfs/libxfs/xfs_refcount_btree.h | 74 ++ > fs/xfs/libxfs/xfs_rmap.c | 1120 +++++++++++++++++++++--- > fs/xfs/libxfs/xfs_rmap.h | 7 + > fs/xfs/libxfs/xfs_rmap_btree.c | 82 +- > fs/xfs/libxfs/xfs_rmap_btree.h | 7 + > fs/xfs/libxfs/xfs_sb.c | 9 + > fs/xfs/libxfs/xfs_shared.h | 2 + > fs/xfs/libxfs/xfs_trans_resv.c | 23 +- > fs/xfs/libxfs/xfs_trans_resv.h | 3 + > fs/xfs/libxfs/xfs_trans_space.h | 9 + > fs/xfs/libxfs/xfs_types.h | 3 +- > fs/xfs/xfs_aops.c | 222 ++++- > fs/xfs/xfs_aops.h | 4 +- > fs/xfs/xfs_bmap_item.c | 508 +++++++++++ > fs/xfs/xfs_bmap_item.h | 98 +++ > fs/xfs/xfs_bmap_util.c | 589 ++++++++++--- > fs/xfs/xfs_dir2_readdir.c | 3 +- > fs/xfs/xfs_error.h | 10 +- > fs/xfs/xfs_file.c | 221 ++++- > fs/xfs/xfs_fsops.c | 107 ++- > fs/xfs/xfs_fsops.h | 3 + > fs/xfs/xfs_globals.c | 5 +- > fs/xfs/xfs_icache.c | 243 +++++- > fs/xfs/xfs_icache.h | 7 + > fs/xfs/xfs_inode.c | 51 ++ > fs/xfs/xfs_inode.h | 19 + > fs/xfs/xfs_inode_item.c | 2 +- > fs/xfs/xfs_ioctl.c | 75 +- > fs/xfs/xfs_iomap.c | 35 +- > fs/xfs/xfs_iomap.h | 3 +- > fs/xfs/xfs_iops.c | 1 + > fs/xfs/xfs_itable.c | 8 +- > fs/xfs/xfs_linux.h | 1 + > fs/xfs/xfs_log_recover.c | 357 ++++++++ > fs/xfs/xfs_mount.c | 32 + > fs/xfs/xfs_mount.h | 8 + > fs/xfs/xfs_ondisk.h | 3 + > fs/xfs/xfs_pnfs.c | 7 + > fs/xfs/xfs_refcount_item.c | 539 ++++++++++++ > fs/xfs/xfs_refcount_item.h | 101 +++ > fs/xfs/xfs_reflink.c | 1688 +++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_reflink.h | 58 ++ > fs/xfs/xfs_rmap_item.c | 12 + > fs/xfs/xfs_stats.c | 1 + > fs/xfs/xfs_stats.h | 18 +- > fs/xfs/xfs_super.c | 87 ++ > fs/xfs/xfs_sysctl.c | 9 + > fs/xfs/xfs_sysctl.h | 1 + > fs/xfs/xfs_trace.h | 742 +++++++++++++++- > fs/xfs/xfs_trans.h | 29 + > fs/xfs/xfs_trans_bmap.c | 249 ++++++ > fs/xfs/xfs_trans_refcount.c | 264 ++++++ > fs/xfs/xfs_trans_rmap.c | 9 + > include/linux/falloc.h | 3 +- > include/uapi/linux/falloc.h | 18 + > include/uapi/linux/fs.h | 4 +- > 76 files changed, 10683 insertions(+), 413 deletions(-) > create mode 100644 fs/xfs/libxfs/xfs_refcount.c > create mode 100644 fs/xfs/libxfs/xfs_refcount.h > create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.c > create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.h > create mode 100644 fs/xfs/xfs_bmap_item.c > create mode 100644 fs/xfs/xfs_bmap_item.h > create mode 100644 fs/xfs/xfs_refcount_item.c > create mode 100644 fs/xfs/xfs_refcount_item.h > create mode 100644 fs/xfs/xfs_reflink.c > create mode 100644 fs/xfs/xfs_reflink.h > create mode 100644 fs/xfs/xfs_trans_bmap.c > create mode 100644 fs/xfs/xfs_trans_refcount.c > -- > Dave Chinner > david@xxxxxxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html