Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
> On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
> > On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
> > > On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
> > > The XFS clone implementation takes the IOLOCK_EXCL high up, and
> > > then lower down it iterates one extent doing the sharing operation.
> > > It holds the ILOCK_EXCL while it is modifying the extent in both the
> > > source and destination files, then commits the transaction and drops
> > > the ILOCKs.
> > > 
> > > OK, so we have fine-grained ILOCK serialisation during the clone for
> > > access/modification to the extent list. Excellent, I think we can
> > > make this work.
> > > 
> > > So:
> > > 
> > > 1. take IOLOCK_EXCL like we already do on the source and destination
> > > files.
> > > 
> > > 2. Once all the pre work is done, set a "clone in progress" flag on
> > > the in-memory source inode.
> > > 
> > > 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
> > > 
> > > 4. read IO and the clone serialise access to the extent list via the
> > > ILOCK. We know this works fine, because that's how the extent list
> > > access serialisation for concurrent read and write direct IO works.
> > > 
> > > 5. buffered writes take the IOLOCK_EXCL, so they block until the
> > > clone completes. Same behaviour as right now, all good.
> > 
> > I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
> > once reflink breaks the layouts, we're good there too?
> 
> I think so.
> 
> <looks to confirm>
> 
> The pnfs code in xfs_fs_map_blocks() will reject mappings on any
> inode marked with shared extents, so I think the fact that we
> set the inode as having shared extents before we finish
> xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
> we even take the IOLOCK.
> 
> But, regardless of that, both new PNFS mappings and DAX writes use
> IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
> layouts which will force them to finish what they are doing and sync
> data before the clone operation grabs the IOLOCK_EXCL. They'll block
> on the clone holding the IOLOCK from that point onwards, so I think
> we're good here.
> 
> hmmmmm.
> 
> <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
> 
> Sigh.
> 
> That will block buffered reads trying to instantiate new pages
> in the page cache. However, this isn't why the invalidate lock is
> held - that's being held to lock out lock page faults (i.e. mmap()
> access) whilst the clone is running.
> 
> 
> We really only need to lock out mmap writes, and the only way to do
> that is to prevent write faults from making progress whilst the
> clone is running.
> 
> __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
> faults - I think we need it to look at the "clone in progress" flag
> for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
> 
> That would then allow us to demote the invalidate lock on the source
> file the same way we do the IOLOCK, allowing buffered reads to
> populate the page caceh but have write faults block until the clone
> completes (as they do now, same as writes).
> 
> Is there anything else I missed?

I think that's it.  I'd wondered how much we really care about reflink
stalling read faults, but yeah, let's fix both.

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux