On Mon, Jul 11, 2022 at 01:13:30PM -0600, James Yonan wrote: > RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and > stands alongside existing flags including RENAME_NOREPLACE, > RENAME_EXCHANGE, and RENAME_WHITEOUT. > > RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and > indicates that if the target of the rename exists, the rename or exchange > will only succeed if the source file is newer than the target (i.e. > source mtime > target mtime). Otherwise, the rename will fail with > -EEXIST instead of replacing the target. When the target doesn't exist, > RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE. > > RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for > conditional exchange, where the exchange only occurs if source mtime > > target mtime. Otherwise, the operation will fail with -EEXIST. > > Some of the use cases for RENAME_NEWER_MTIME include (a) using a > directory as a key-value store, or (b) maintaining a near-real-time > mirror of a remote data source. A common design pattern for maintaining > such a data store would be to create a file using a temporary pathname, > setting the file mtime using utimensat(2) or futimens(2) based on the > remote creation timestamp of the file content, then using > RENAME_NEWER_MTIME to move the file into place in the target directory. > If the operation returns an error with errno == EEXIST, then the source > file is not up-to-date and can safely be deleted. The goal is to > facilitate distributed systems having many concurrent writers and > readers, where update notifications are possibly delayed, duplicated, or > reordered, yet where readers see a consistent view of the target > directory with predictable semantics and atomic updates. > > Note that RENAME_NEWER_MTIME depends on accurate, high-resolution > timestamps for mtime, preferably approaching nanosecond resolution. > > RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny > write access to both source and target inodes before comparing their > mtimes, to stabilize the comparison. > > The use case for RENAME_NEWER_MTIME doesn't really align with > directories, so we return -EISDIR if either source or target is a > directory. This makes the locking necessary to stabilize the mtime > comparison (in vfs_rename()) much more straightforward. > > Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in > the VFS, however the individual fs implementations do strict flags > checking and will return -EINVAL for any flag they don't recognize. > At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3, > ext4, xfs, btrfs, and tmpfs. > > I did not notice a general self-test for renameat2() at the VFS > layer (outside of fs-specific tests), We have a whole bunch of renameat2() tests in fstests that cover all the functionality of renameat2(), and fsstress will also exercise it in stress workloads, too: $ git grep -l renameat2 .gitignore common/renameat2 configure.ac ltp/fsstress.c src/Makefile src/renameat2.c tests/btrfs/247 tests/generic/023 tests/generic/024 tests/generic/025 tests/generic/078 tests/generic/398 tests/generic/419 tests/generic/585 tests/generic/621 tests/generic/626 > so I created one, though > at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE. > The self-test is written to be portable to the Linux Test Project, > and the advantage of running it there is that it automatically runs > tests on multiple filesystems. See comments at the beginning of > renameat2_tests.c for more info. Ideally, new renameat2 correctness tests should be added to fstests as per the existing tests (as this is the primary test suite a lot of fs developers use) so that we don't end up with partial test coverage fragmented across different test suites. It does us no favors to have non-overlapping partial coverage in different test suites - we are better to implement complete coverage in one test suite and focus our efforts there... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx