Re: proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange

Catherine Hoang <catherine.hoang@xxxxxxxxxx> · Tue, 26 Sep 2023 21:18:38 +0000

> On Sep 25, 2023, at 2:28 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> 
> On Fri, Sep 22, 2023 at 09:18:48AM +1000, Dave Chinner wrote:
>> On Thu, Sep 21, 2023 at 03:26:28PM -0700, Darrick J. Wong wrote:
>>> On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote:
>>>> On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote:
>>>>> On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote:
>>>>>> On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote:
>>>>>> The XFS clone implementation takes the IOLOCK_EXCL high up, and
>>>>>> then lower down it iterates one extent doing the sharing operation.
>>>>>> It holds the ILOCK_EXCL while it is modifying the extent in both the
>>>>>> source and destination files, then commits the transaction and drops
>>>>>> the ILOCKs.
>>>>>> 
>>>>>> OK, so we have fine-grained ILOCK serialisation during the clone for
>>>>>> access/modification to the extent list. Excellent, I think we can
>>>>>> make this work.
>>>>>> 
>>>>>> So:
>>>>>> 
>>>>>> 1. take IOLOCK_EXCL like we already do on the source and destination
>>>>>> files.
>>>>>> 
>>>>>> 2. Once all the pre work is done, set a "clone in progress" flag on
>>>>>> the in-memory source inode.
>>>>>> 
>>>>>> 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED.
>>>>>> 
>>>>>> 4. read IO and the clone serialise access to the extent list via the
>>>>>> ILOCK. We know this works fine, because that's how the extent list
>>>>>> access serialisation for concurrent read and write direct IO works.
>>>>>> 
>>>>>> 5. buffered writes take the IOLOCK_EXCL, so they block until the
>>>>>> clone completes. Same behaviour as right now, all good.
>>>>> 
>>>>> I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right?  So
>>>>> once reflink breaks the layouts, we're good there too?
>>>> 
>>>> I think so.
>>>> 
>>>> <looks to confirm>
>>>> 
>>>> The pnfs code in xfs_fs_map_blocks() will reject mappings on any
>>>> inode marked with shared extents, so I think the fact that we
>>>> set the inode as having shared extents before we finish
>>>> xfs_reflink_remap_prep() will cause pnfs mappings to kick out before
>>>> we even take the IOLOCK.
>>>> 
>>>> But, regardless of that, both new PNFS mappings and DAX writes use
>>>> IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX
>>>> layouts which will force them to finish what they are doing and sync
>>>> data before the clone operation grabs the IOLOCK_EXCL. They'll block
>>>> on the clone holding the IOLOCK from that point onwards, so I think
>>>> we're good here.
>>>> 
>>>> hmmmmm.
>>>> 
>>>> <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()>
>>>> 
>>>> Sigh.
>>>> 
>>>> That will block buffered reads trying to instantiate new pages
>>>> in the page cache. However, this isn't why the invalidate lock is
>>>> held - that's being held to lock out lock page faults (i.e. mmap()
>>>> access) whilst the clone is running.
>>>> 
>>>> 
>>>> We really only need to lock out mmap writes, and the only way to do
>>>> that is to prevent write faults from making progress whilst the
>>>> clone is running.
>>>> 
>>>> __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write
>>>> faults - I think we need it to look at the "clone in progress" flag
>>>> for write faults, too, and use XFS_MMAPLOCK_EXCL in that case.
>>>> 
>>>> That would then allow us to demote the invalidate lock on the source
>>>> file the same way we do the IOLOCK, allowing buffered reads to
>>>> populate the page caceh but have write faults block until the clone
>>>> completes (as they do now, same as writes).
>>>> 
>>>> Is there anything else I missed?
>>> 
>>> I think that's it.  I'd wondered how much we really care about reflink
>>> stalling read faults, but yeah, let's fix both.
>> 
>> Well, it's not so much about mmap as the fact that holding
>> invalidate lock exclusive prevents adding or removing folios to the
>> page cache from any path. Hence the change as I originally proposed
>> would block the buffered read path trying to add pages to the page
>> cache the same as it will block the read fault path....
> 
> Ah, ok.
> 
> Catherine: Do you have enough information to get started on a proof of
> concept?

Yup, this solution makes sense to me.

> 
> --D
> 
>> Cheers,
>> 
>> Dave.
>> 
>> -- 
>> Dave Chinner
>> david@xxxxxxxxxxxxx