Hi all, This series creates a new FIEXCHANGE_RANGE system call to exchange ranges of bytes between two files atomically. This new functionality enables data storage programs to stage and commit file updates such that reader programs will see either the old contents or the new contents in their entirety, with no chance of torn writes. A successful call completion guarantees that the new contents will be seen even if the system fails. The ability to swap extent mappings between files in this manner is critical to supporting online filesystem repair, which is built upon the strategy of constructing a clean copy of a damaged structure and committing the new structure into the metadata file atomically. User programs will be able to update files atomically by opening an O_TMPFILE, reflinking the source file to it, making whatever updates they want to make, and exchange the relevant ranges of the temp file with the original file. If the updates are aligned with the file block size, a new (since v2) flag provides for exchanging only the written areas. Callers can arrange for the update to be rejected if the original file has been changed. The intent behind this new userspace functionality is to enable atomic rewrites of arbitrary parts of individual files. For years, application programmers wanting to ensure the atomicity of a file update had to write the changes to a new file in the same directory, fsync the new file, rename the new file on top of the old filename, and then fsync the directory. People get it wrong all the time, and $fs hacks abound. The reference implementation in XFS creates a new log incompat feature and log intent items to track high level progress of swapping ranges of two files and finish interrupted work if the system goes down. Sample code can be found in the corresponding changes to xfs_io to exercise the use case mentioned above. Note that this function is /not/ the O_DIRECT atomic file writes concept that has also been floating around for years. This RFC is constructed entirely in software, which means that there are no limitations other than the general filesystem limits. As a side note, the original motivation behind the kernel functionality is online repair of file-based metadata. The atomic file swap is implemented as an atomic inode fork swap, which means that we can implement online reconstruction of extended attributes and directories by building a new one in another inode and atomically swap the contents. Subsequent patchsets adapt the online filesystem repair code to use atomic extent swapping. This enables repair functions to construct a clean copy of a directory, xattr information, symbolic links, realtime bitmaps, and realtime summary information in a temporary inode. If this completes successfully, the new contents can be swapped atomically into the inode being repaired. This is essential to avoid making corruption problems worse if the system goes down in the middle of running repair. This patchset also ports the old XFS extent swap ioctl interface to use the new extent swap code. For userspace, this series also includes the userspace pieces needed to test the new functionality, and a sample implementation of atomic file updates. Question: Should we really bother with fsdevel bikeshedding? Most filesystems cannot support this functionality, so we could keep it private to XFS for now. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates xfsdocs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates --- configure.ac | 1 fsr/xfs_fsr.c | 214 ++++--- include/builddefs.in | 1 include/jdm.h | 24 + include/xfs_inode.h | 5 include/xfs_trace.h | 13 io/Makefile | 6 io/atomicupdate.c | 387 ++++++++++++ io/init.c | 1 io/inject.c | 1 io/io.h | 5 io/open.c | 27 + io/swapext.c | 195 +++++- libfrog/Makefile | 6 libfrog/fiexchange.h | 105 +++ libfrog/file_exchange.c | 184 ++++++ libfrog/file_exchange.h | 16 libfrog/fsgeom.c | 45 + libfrog/fsgeom.h | 7 libhandle/jdm.c | 117 ++++ libxfs/Makefile | 2 libxfs/defer_item.c | 79 ++ libxfs/libxfs_priv.h | 30 + libxfs/xfs_bmap.h | 4 libxfs/xfs_defer.c | 7 libxfs/xfs_defer.h | 3 libxfs/xfs_errortag.h | 4 libxfs/xfs_format.h | 15 libxfs/xfs_fs.h | 2 libxfs/xfs_log_format.h | 80 ++ libxfs/xfs_sb.c | 3 libxfs/xfs_swapext.c | 1256 +++++++++++++++++++++++++++++++++++++++ libxfs/xfs_swapext.h | 170 +++++ libxfs/xfs_symlink_remote.c | 47 + libxfs/xfs_symlink_remote.h | 1 libxfs/xfs_trans_space.h | 4 logprint/log_misc.c | 11 logprint/log_print_all.c | 12 logprint/log_redo.c | 128 ++++ logprint/logprint.h | 6 m4/package_libcdev.m4 | 20 + man/man2/ioctl_xfs_fsgeometry.2 | 3 man/man8/xfs_io.8 | 87 +++ 43 files changed, 3183 insertions(+), 151 deletions(-) create mode 100644 io/atomicupdate.c create mode 100644 libfrog/fiexchange.h create mode 100644 libfrog/file_exchange.c create mode 100644 libfrog/file_exchange.h create mode 100644 libxfs/xfs_swapext.c create mode 100644 libxfs/xfs_swapext.h