On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote: > >From falloc.h: > > FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the > file logical-to-physical extent offset mappings in the file. The > purpose is to allow an application to assume that there are no holes > or shared extents in the file and that the metadata needed to find > all the physical extents of the file is stable and can never be > dirtied. > > For now this patch only permits setting the in-memory state of > S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is > saved for later patches. > > The implementation is careful to not allow the immutable state to change > while any process might have any established mappings. It reuses the > existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare > extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL > while it validates the file is in the proper state and sets > S_IOMAP_IMMUTABLE. > > Cc: Jan Kara <jack@xxxxxxx> > Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> > Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> > Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx> > Suggested-by: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > --- > fs/open.c | 11 +++++ > fs/xfs/xfs_bmap_util.c | 101 +++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_bmap_util.h | 2 + > fs/xfs/xfs_file.c | 14 ++++-- > include/linux/falloc.h | 3 + > include/uapi/linux/falloc.h | 19 ++++++++ > 6 files changed, 145 insertions(+), 5 deletions(-) > > diff --git a/fs/open.c b/fs/open.c > index 7395860d7164..e3aae59785ae 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE))) > return -EINVAL; > > + /* > + * Seal block map operation should only be used exclusively, and > + * with the IMMUTABLE capability. > + */ > + if (mode & FALLOC_FL_SEAL_BLOCK_MAP) { > + if (!capable(CAP_LINUX_IMMUTABLE)) > + return -EPERM; > + if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP) > + return -EINVAL; > + } > + > if (!(file->f_mode & FMODE_WRITE)) > return -EBADF; > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index fe0f8f7f4bb7..46d8eb9e19fc 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -1393,6 +1393,107 @@ xfs_zero_file_space( > > } > > +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */ > +STATIC int > +xfs_file_has_holes( > + struct xfs_inode *ip) > +{ > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_bmbt_irec *map; > + const int map_size = 10; /* constrain memory overhead */ > + int i, nmaps; > + int error = 0; > + xfs_fileoff_t lblkno = 0; > + xfs_filblks_t maxlblkcnt; > + > + map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP); Sleeping with an inode fully locked and (eventually) a running transaction? Yikes. Just allocate one xfs_bmbt_irec on the stack and pass in nmaps=1. This method might fit better in libxfs/xfs_bmap.c where it'll be able to scan the extent list more quickly with the iext helpers. > + > + maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip))); > + do { > + nmaps = map_size; > + error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno, > + map, &nmaps, 0); > + if (error) > + break; > + > + ASSERT(nmaps <= map_size); > + for (i = 0; i < nmaps; i++) { > + lblkno += map[i].br_blockcount; > + if (map[i].br_startblock == HOLESTARTBLOCK) { I think we also need to check for unwritten extents here, because a write to an unwritten block requires some zeroing and a mapping metdata update. > + error = 1; > + break; > + } > + } > + } while (nmaps > 0 && error == 0); > + > + kmem_free(map); > + return error; > +} > + > +int > +xfs_seal_file_space( > + struct xfs_inode *ip, > + xfs_off_t offset, > + xfs_off_t len) > +{ > + struct inode *inode = VFS_I(ip); > + struct address_space *mapping = inode->i_mapping; > + int error; > + > + ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); The IOLOCK must be held here too. > + > + if (offset) > + return -EINVAL; > + > + error = xfs_reflink_unshare(ip, offset, len); > + if (error) > + return error; > + > + error = xfs_alloc_file_space(ip, offset, len, > + XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO); > + if (error) > + return error; > + > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + /* > + * Either the size changed after we performed allocation / > + * unsharing, or the request was too small to begin with. > + */ > + error = -EINVAL; > + if (len < i_size_read(inode)) > + goto out_unlock; > + > + /* > + * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE > + * will never change while any mapping is established. > + */ > + error = -EBUSY; > + if (mapping_mapped(mapping)) > + goto out_unlock; > + > + /* Did we race someone attempting to share extents? */ > + if (xfs_is_reflink_inode(ip)) > + goto out_unlock; > + > + /* Did we race a hole punch? */ > + error = xfs_file_has_holes(ip); > + if (error == 1) { > + error = -EBUSY; > + goto out_unlock; > + } > + > + /* Abort on an error reading the block map */ > + if (error < 0) > + goto out_unlock; > + > + inode->i_flags |= S_IOMAP_IMMUTABLE; > + > +out_unlock: > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + > + return error; > +} > + > /* > * @next_fsb will keep track of the extent currently undergoing shift. > * @stop_fsb will keep track of the extent at which we have to stop. > diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h > index 0cede1043571..5115a32a2483 100644 > --- a/fs/xfs/xfs_bmap_util.h > +++ b/fs/xfs/xfs_bmap_util.h > @@ -60,6 +60,8 @@ int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset, > xfs_off_t len); > int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset, > xfs_off_t len); > +int xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset, > + xfs_off_t len); > > /* EOF block manipulation functions */ > bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force); > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index c4893e226fd8..e21121530a90 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -739,7 +739,8 @@ xfs_file_write_iter( > #define XFS_FALLOC_FL_SUPPORTED \ > (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \ > FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | \ > - FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE) > + FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE | \ > + FALLOC_FL_SEAL_BLOCK_MAP) > > STATIC long > xfs_file_fallocate( > @@ -834,9 +835,14 @@ xfs_file_fallocate( > error = xfs_reflink_unshare(ip, offset, len); > if (error) > goto out_unlock; > - } > - error = xfs_alloc_file_space(ip, offset, len, > - XFS_BMAPI_PREALLOC); > + > + error = xfs_alloc_file_space(ip, offset, len, > + XFS_BMAPI_PREALLOC); > + } else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) { > + error = xfs_seal_file_space(ip, offset, len); > + } else > + error = xfs_alloc_file_space(ip, offset, len, > + XFS_BMAPI_PREALLOC); > } > if (error) > goto out_unlock; > diff --git a/include/linux/falloc.h b/include/linux/falloc.h > index 7494dc67c66f..48546c6fbec7 100644 > --- a/include/linux/falloc.h > +++ b/include/linux/falloc.h > @@ -26,6 +26,7 @@ struct space_resv { > FALLOC_FL_COLLAPSE_RANGE | \ > FALLOC_FL_ZERO_RANGE | \ > FALLOC_FL_INSERT_RANGE | \ > - FALLOC_FL_UNSHARE_RANGE) > + FALLOC_FL_UNSHARE_RANGE | \ > + FALLOC_FL_SEAL_BLOCK_MAP) > > #endif /* _FALLOC_H_ */ > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h > index b075f601919b..39076975bf6f 100644 > --- a/include/uapi/linux/falloc.h > +++ b/include/uapi/linux/falloc.h > @@ -76,4 +76,23 @@ > */ > #define FALLOC_FL_UNSHARE_RANGE 0x40 > > +/* > + * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the > + * file logical-to-physical extent offset mappings in the file. The > + * purpose is to allow an application to assume that there are no holes > + * or shared extents in the file and that the metadata needed to find > + * all the physical extents of the file is stable and can never be > + * dirtied. > + * > + * The immutable property is in effect for the entire inode, so the > + * range for this operation must start at offset 0 and len must be > + * greater than or equal to the current size of the file. If greater, > + * this operation allocates, unshares, hole fills, and seals in one 'allocates' is the same as 'hole fills', I think. This converts unwritten extents to zeroed written extents too, correct? > + * atomic step. If len is zero then the immutable state is cleared for > + * the inode. It's cleared if len == 0? I thought that was what FL_UNSEAL is for? > + * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used > + * with the punch, zero, collapse, or insert range modes. > + */ > +#define FALLOC_FL_SEAL_BLOCK_MAP 0x080 > #endif /* _UAPI_FALLOC_H_ */ > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html