On Tue, Jun 28, 2022 at 11:07:55PM -0700, Wengang Wang wrote: > During a reflink operation, the IOLOCK and MMAPLOCK of the source file > are held in exclusive mode for the duration. This prevents reads on the > source file, which could be a very long time if the source file has > millions of extents. > > As the source of copy, besides some necessary modification (say dirty page > flushing), it plays readonly role. Locking source file exclusively through > out the full reflink copy is unreasonable. > > This patch downgrades exclusive locks on source file to shared modes after > page cache flushing and before cloning the extents. To avoid source file > change after lock downgradation, direct write paths take IOLOCK_EXCL on > seeing reflink copy happening to the files. > > Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> > --- > V2 changes: > Commit message > Make direct write paths take IOLOCK_EXCL when reflink copy is happening > Tiny changes > --- > fs/xfs/xfs_file.c | 33 ++++++++++++++++++++++++++++++--- > fs/xfs/xfs_inode.c | 31 +++++++++++++++++++++++++++++++ > fs/xfs/xfs_inode.h | 11 +++++++++++ > 3 files changed, 72 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 5a171c0b244b..6ca7118ee274 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -514,8 +514,10 @@ xfs_file_dio_write_aligned( > struct iov_iter *from) > { > unsigned int iolock = XFS_IOLOCK_SHARED; > + int remapping; > ssize_t ret; > > +relock: > ret = xfs_ilock_iocb(iocb, iolock); > if (ret) > return ret; > @@ -523,14 +525,25 @@ xfs_file_dio_write_aligned( > if (ret) > goto out_unlock; > > + remapping = xfs_iflags_test(ip, XFS_IREMAPPING); > + > /* > * We don't need to hold the IOLOCK exclusively across the IO, so demote > * the iolock back to shared if we had to take the exclusive lock in > * xfs_file_write_checks() for other reasons. > + * But take IOLOCK_EXCL when reflink copy is going on > */ > if (iolock == XFS_IOLOCK_EXCL) { > - xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > - iolock = XFS_IOLOCK_SHARED; > + if (!remapping) { > + xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > + iolock = XFS_IOLOCK_SHARED; > + } > + } else { /* iolock == XFS_ILOCK_SHARED */ > + if (remapping) { > + xfs_iunlock(ip, iolock); > + iolock = XFS_IOLOCK_EXCL; > + goto relock; > + } > } I'm not sure we can do relocking here. We've already run xfs_file_write_checks() once which means we've already called file_modified() and made changes to the inode. Indeed, if we know that remapping is going on, why are we even starting with XFS_IOLOCK_SHARED? i.e. shouldn't this just be: unsigned int iolock = XFS_IOLOCK_SHARED; ssize_t ret; if (!xfs_iflags_test(ip, XFS_IREMAPPING)) iolock = XFS_IOLOCK_EXCL; retry: ret = xfs_ilock_iocb(iocb, iolock); if (ret) return ret; if (xfs_iflags_test(ip, XFS_IREMAPPING) && iolock == XFS_IOLOCK_SHARED) { /* Raced with a remap operation starting! */ xfs_iunlock(ip, XFS_IOLOCK_SHARED); iolock = XFS_IOLOCK_EXCL; goto restart; } .... And no other changes need to be made? i.e. we can downgrade the XFS_IOLOCK_EXCL safely at any time in this path because the barrier to reflink starting and setting the XFS_IREMAPPING flag is that it has to be holding XFS_IOLOCK_EXCL and holding the IOLOCK in any mode will hold off any reflink starting. However, the key thing to note is that holding the IOLOCK in either EXCL or SHARED does not not actually guarantee that there are no DIO writes in progress. We only have to hold the lock over submission to ensure the inode_dio_begin() call is serialised correctly..... i.e. when we do AIO DIO writes, we hold the IOLOCK_SHARED over submission while the inode->i_dio_count is incremented, but then drop the IOLOCK before completion occurs. Hence the only way to guarantee that there is no DIO writes in progress is to take IOLOCK_EXCL to stop new DIO writes from starting, and then call inode_dio_wait() to wait for i_dio_count to fall to zero. Once inode_dio_wait() returns, we're guaranteed that there are no DIO writes in progress. IOWs, the demotion from EXCL to SHARED in xfs_file_dio_write_aligned() code is completely irrelevant from the perspective of serialising against DIO writes in progress, so we don't need to touch that code. However, if we are going to demote the IOLOCK in the reflink code, we now need a separate barrier to prevent dio writes from starting while the reflink is in progress. As you've implemented, the IREMAPPING flag provides this barrier, but your code didn't handle the TOCTOU races in gaining the IOLOCK in the correct mode in the DIO path.... > trace_xfs_file_direct_write(iocb, from); > ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops, > @@ -1125,6 +1138,19 @@ xfs_file_remap_range( > if (ret || len == 0) > return ret; > > + /* > + * Set XFS_IREMAPPING flag to source file before we downgrade > + * the locks, so that all direct writes know they have to take > + * IOLOCK_EXCL. > + */ > + xfs_iflags_set(src, XFS_IREMAPPING); > + > + /* > + * From now on, we read only from src, so downgrade locks to allow > + * read operations go. > + */ > + xfs_ilock_io_mmap_downgrade_src(src, dest); Oh, you've been lucky here. :) xfs_reflink_remap_prep() -> generic_remap_file_range_prep() calls inode_dio_wait() on both inodes while they are locked exclusively. However, I don't think you can downgrade the mmap lock here - that will allow page faults to dirty the pages in the source file. I'm also not sure that we can even take the mmap lock exclusively in the page fault path like we can the IOLOCK in the IO path.... Hence I think it is simplest just to consider it unsafe to downgrade the mmap lock here and so only the IO lock can be downgraded. For read IO, that's the only one that needs to be downgraded, anyway. > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index 7be6f8e705ab..c07d4b42cf9d 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -262,6 +262,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip) > */ > #define XFS_INACTIVATING (1 << 13) > > +/* > + * A flag indicating reflink copy / remapping is happening to the file as > + * source. When set, all direct IOs should take IOLOCK_EXCL to avoid > + * interphering the remapping. > + */ > +#define XFS_IREMAPPING (1 << 14) > + > /* All inode state flags related to inode reclaim. */ > #define XFS_ALL_IRECLAIM_FLAGS (XFS_IRECLAIMABLE | \ > XFS_IRECLAIM | \ > @@ -512,5 +519,9 @@ void xfs_end_io(struct work_struct *work); > > int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); > void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); > +void xfs_ilock_io_mmap_downgrade_src(struct xfs_inode *src, > + struct xfs_inode *dest); > +void xfs_iunlock2_io_mmap_src_shared(struct xfs_inode *src, > + struct xfs_inode *dest); I suspect we are at the point where we really need to move all the inode locking out of xfs_inode.[ch] and into separate xfs_inode_locking.[ch] files. Not necessary for this patchset, but if we keep adding more locking helpers.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx