Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 22 Sep 2023 09:18:48 +1000

On Thu, Sep 21, 2023 at 03:26:28PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
> > On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> > > On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > > > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > > > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > > > then lower down it iterates one extent doing the sharing operation.
> > > > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > > > source and destination files, then commits the transaction and drops
> > > > the ILOCKs.
> > > > 
> > > > OK, so we have fine-grained ILOCK serialisation during the clone for
> > > > access/modification to the extent list. Excellent, I think we can
> > > > make this work.
> > > > 
> > > > So:
> > > > 
> > > > 1. take IOLOCK_EXCL like we already do on the source and destination
> > > > files.
> > > > 
> > > > 2. Once all the pre work is done, set a "clone in progress" flag on
> > > > the in-memory source inode.
> > > > 
> > > > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > > > 
> > > > 4. read IO and the clone serialise access to the extent list via the
> > > > ILOCK. We know this works fine, because that's how the extent list
> > > > access serialisation for concurrent read and write direct IO works.
> > > > 
> > > > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > > > clone completes. Same behaviour as right now, all good.
> > > 
> > > I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> > > once reflink breaks the layouts, we're good there too?
> > 
> > I think so.
> > 
> > <looks to confirm>
> > 
> > The pnfs code in xfs_fs_map_blocks() will reject mappings on any
> > inode marked with shared extents, so I think the fact that we
> > set the inode as having shared extents before we finish
> > xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
> > we even take the IOLOCK.
> > 
> > But, regardless of that, both new PNFS mappings and DAX writes use
> > IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
> > layouts which will force them to finish what they are doing and sync
> > data before the clone operation grabs the IOLOCK_EXCL. They'll block
> > on the clone holding the IOLOCK from that point onwards, so I think
> > we're good here.
> > 
> > hmmmmm.
> > 
> > <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
> > 
> > Sigh.
> > 
> > That will block buffered reads trying to instantiate new pages
> > in the page cache. However, this isn't why the invalidate lock is
> > held - that's being held to lock out lock page faults (i.e. mmap()
> > access) whilst the clone is running.
> > 
> > 
> > We really only need to lock out mmap writes, and the only way to do
> > that is to prevent write faults from making progress whilst the
> > clone is running.
> > 
> > __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
> > faults - I think we need it to look at the "clone in progress" flag
> > for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
> > 
> > That would then allow us to demote the invalidate lock on the source
> > file the same way we do the IOLOCK, allowing buffered reads to
> > populate the page caceh but have write faults block until the clone
> > completes (as they do now, same as writes).
> > 
> > Is there anything else I missed?
> 
> I think that's it.  I'd wondered how much we really care about reflink
> stalling read faults, but yeah, let's fix both.

Well, it's not so much about mmap as the fact that holding
invalidate lock exclusive prevents adding or removing folios to the
page cache from any path. Hence the change as I originally proposed
would block the buffered read path trying to add pages to the page
cache the same as it will block the read fault path....

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx