Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Sep 2023 11:07:37 +1000

On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > then lower down it iterates one extent doing the sharing operation.
> > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > source and destination files, then commits the transaction and drops
> > the ILOCKs.
> > 
> > OK, so we have fine-grained ILOCK serialisation during the clone for
> > access/modification to the extent list. Excellent, I think we can
> > make this work.
> > 
> > So:
> > 
> > 1. take IOLOCK_EXCL like we already do on the source and destination
> > files.
> > 
> > 2. Once all the pre work is done, set a "clone in progress" flag on
> > the in-memory source inode.
> > 
> > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > 
> > 4. read IO and the clone serialise access to the extent list via the
> > ILOCK. We know this works fine, because that's how the extent list
> > access serialisation for concurrent read and write direct IO works.
> > 
> > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > clone completes. Same behaviour as right now, all good.
> 
> I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> once reflink breaks the layouts, we're good there too?

I think so.

<looks to confirm>

The pnfs code in xfs_fs_map_blocks() will reject mappings on any
inode marked with shared extents, so I think the fact that we
set the inode as having shared extents before we finish
xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
we even take the IOLOCK.

But, regardless of that, both new PNFS mappings and DAX writes use
IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
layouts which will force them to finish what they are doing and sync
data before the clone operation grabs the IOLOCK_EXCL. They'll block
on the clone holding the IOLOCK from that point onwards, so I think
we're good here.

hmmmmm.

<notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>

Sigh.

That will block buffered reads trying to instantiate new pages
in the page cache. However, this isn't why the invalidate lock is
held - that's being held to lock out lock page faults (i.e. mmap()
access) whilst the clone is running.

We really only need to lock out mmap writes, and the only way to do
that is to prevent write faults from making progress whilst the
clone is running.

__xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
faults - I think we need it to look at the "clone in progress" flag
for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.

That would then allow us to demote the invalidate lock on the source
file the same way we do the IOLOCK, allowing buffered reads to
populate the page caceh but have write faults block until the clone
completes (as they do now, same as writes).

Is there anything else I missed?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx