On Mon, Jul 31, 2017 at 10:09 AM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > On Sat, Jul 29, 2017 at 12:43:40PM -0700, Dan Williams wrote: >> >From falloc.h: >> >> FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the >> file logical-to-physical extent offset mappings in the file. The >> purpose is to allow an application to assume that there are no holes >> or shared extents in the file and that the metadata needed to find >> all the physical extents of the file is stable and can never be >> dirtied. >> >> For now this patch only permits setting / clearing the in-memory state >> of S_IOMAP_IMMMUTABLE, persisting the state is saved for a later patch. >> >> The implementation is careful to not allow the immutable state to change >> while any process might have any established mappings. It reuses the >> existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare >> extents and fill all holes in the file, or otherwise extend the file >> size in the same operation that sets S_IOMAP_IMMUTABLE. >> >> Cc: Jan Kara <jack@xxxxxxx> >> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> >> Cc: Christoph Hellwig <hch@xxxxxx> >> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> >> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> >> Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx> >> Suggested-by: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> >> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> >> --- >> fs/open.c | 26 ++++++++++++- >> fs/xfs/xfs_bmap_util.c | 86 +++++++++++++++++++++++++++++++++++++++++++ >> fs/xfs/xfs_bmap_util.h | 2 + >> fs/xfs/xfs_file.c | 14 +++++-- >> include/linux/falloc.h | 3 +- >> include/uapi/linux/falloc.h | 19 ++++++++++ >> 6 files changed, 142 insertions(+), 8 deletions(-) >> >> diff --git a/fs/open.c b/fs/open.c >> index 7395860d7164..df075484fad5 100644 >> --- a/fs/open.c >> +++ b/fs/open.c >> @@ -241,7 +241,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) >> struct inode *inode = file_inode(file); >> long ret; >> >> - if (offset < 0 || len <= 0) >> + if (offset < 0 || len < 0) >> + return -EINVAL; >> + >> + /* Allow zero len only for the unseal operation */ >> + if (!(mode & FALLOC_FL_SEAL_BLOCK_MAP) && len == 0) >> return -EINVAL; >> >> /* Return error if mode is not supported */ >> @@ -273,6 +277,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) >> (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE))) >> return -EINVAL; >> >> + /* >> + * Seal block map should only be used exclusively, and with >> + * the IMMUTABLE capability. >> + */ >> + if (mode & FALLOC_FL_SEAL_BLOCK_MAP) { >> + if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP) >> + return -EINVAL; >> + if (!capable(CAP_LINUX_IMMUTABLE)) >> + return -EPERM; >> + } >> + >> if (!(file->f_mode & FMODE_WRITE)) >> return -EBADF; >> >> @@ -292,9 +307,14 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) >> return -ETXTBSY; >> >> /* >> - * We cannot allow any allocation changes on an iomap immutable file >> + * We cannot allow any allocation changes on an iomap immutable >> + * file, however if the operation is FALLOC_FL_SEAL_BLOCK_MAP, >> + * call down to ->fallocate() to determine if the operations is >> + * allowed. ->fallocate() may either clear the flag when @len is >> + * zero, or validate that the requested operation is already the >> + * current state of the file. >> */ >> - if (IS_IOMAP_IMMUTABLE(inode)) >> + if (IS_IOMAP_IMMUTABLE(inode) && (!(mode & FALLOC_FL_SEAL_BLOCK_MAP))) >> return -ETXTBSY; >> >> /* >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c >> index 93e955262d07..c4fc79a0704f 100644 >> --- a/fs/xfs/xfs_bmap_util.c >> +++ b/fs/xfs/xfs_bmap_util.c >> @@ -1387,6 +1387,92 @@ xfs_zero_file_space( >> >> } >> >> +int >> +xfs_seal_file_space( >> + struct xfs_inode *ip, >> + xfs_off_t offset, >> + xfs_off_t len) >> +{ >> + struct inode *inode = VFS_I(ip); >> + struct address_space *mapping = inode->i_mapping; >> + int error = 0; >> + >> + if (offset) >> + return -EINVAL; >> + >> + i_mmap_lock_read(mapping); > > (Are we allowed to take address_space->i_mmap_rwsem while holding > xfs_inode->i_mmaplock?) Empirically, yes. Lockdep complains when those locks are taken in the reverse order. That seems to be inconsistent with the "mmap_sem -> i_mmap_lock -> page_lock" note in the xfs_ilock comment. Am I confusing what i_mmap_lock means in that comment, is that the i_mmap_lock_read(), or the i_mmaplock in the xfs_inode? > >> + xfs_ilock(ip, XFS_ILOCK_EXCL); >> + if (len == 0) { >> + /* >> + * Clear the immutable flag provided there are no active >> + * mappings. The active mapping check prevents an >> + * application that is assuming a static block map, for >> + * DAX or peer-to-peer DMA, from having this state >> + * silently change behind its back. >> + */ >> + if (RB_EMPTY_ROOT(&mapping->i_mmap)) >> + inode->i_flags &= ~S_IOMAP_IMMUTABLE; >> + else >> + error = -EBUSY; >> + } else if (IS_IOMAP_IMMUTABLE(inode)) { >> + if (len == i_size_read(inode)) { >> + /* >> + * The file is already in the correct state, >> + * bail out without error below. >> + */ >> + len = 0; >> + } else { >> + /* too late to allocate more space */ >> + error = -ETXTBSY; >> + } >> + } else { >> + if (len < i_size_read(inode)) { >> + /* >> + * Since S_IOMAP_IMMUTABLE is inode global it >> + * does not make sense to fallocate(immutable) >> + * on a sub-range of the file. >> + */ >> + error = -EINVAL; >> + } else if (!RB_EMPTY_ROOT(&mapping->i_mmap)) { >> + /* >> + * It's not strictly required to prevent setting >> + * immutable while a file is already mapped, but >> + * we do it for simplicity and symmetry with the >> + * S_IOMAP_IMMUTABLE disable case. >> + */ >> + error = -EBUSY; >> + } else >> + inode->i_flags |= S_IOMAP_IMMUTABLE; >> + } >> + xfs_iunlock(ip, XFS_ILOCK_EXCL); >> + i_mmap_unlock_read(mapping); >> + >> + if (error || len == 0) >> + return error; >> + >> + /* >> + * From here, the immutable flag is already set, so new >> + * operations that would change the block map are prevented by >> + * upper layer code paths. Wwe can proceed to unshare and >> + * allocate zeroed / written extents. >> + */ >> + error = xfs_reflink_unshare(ip, offset, len); > > At this point we still hold the io and mmap locks and the vfs thinks the > inode is iomap_immutable, but we haven't actually fixed the block > mappings, which means that the flag is set but there could be holes and > shared extents aplenty? > > That seems strange to me -- wouldn't we want to try to unshare and > allocate, and only then take the ilock, check the mappings, and only set > the flag if nobody's messed with the extent map since the unshare & > allocated? IOWs, > > if (len == 0) > return xfs_unseal_file_space(); > > xfs_reflink_unshare(...); > xfs_alloc_file_space(...); > > xfs_ilock(...); > if (xfs_iomap_lacks_holes_and_shared_blocks(...)) { > VFS_I(ip)->i_flags |= S_IOMAP_IMMUTABLE; > ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE; > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > } else { > error = -EBUSY; > } > xfs_iunlock(...); > > (I guess we hold sufficient locks, but still...) Yes, that looks safer, especially if other vfs paths make assumptions about the block map upon seeing that flag set.