> On Sep 25, 2023, at 2:28 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Fri, Sep 22, 2023 at 09:18:48AM +1000, Dave Chinner wrote: >> On Thu, Sep 21, 2023 at 03:26:28PM -0700, Darrick J. Wong wrote: >>> On Wed, Sep 20, 2023 at 11:07:37AM +1000, Dave Chinner wrote: >>>> On Tue, Sep 19, 2023 at 05:00:58PM -0700, Darrick J. Wong wrote: >>>>> On Tue, Sep 19, 2023 at 03:51:24PM +1000, Dave Chinner wrote: >>>>>> On Tue, Sep 19, 2023 at 02:43:32AM +0000, Catherine Hoang wrote: >>>>>> The XFS clone implementation takes the IOLOCK_EXCL high up, and >>>>>> then lower down it iterates one extent doing the sharing operation. >>>>>> It holds the ILOCK_EXCL while it is modifying the extent in both the >>>>>> source and destination files, then commits the transaction and drops >>>>>> the ILOCKs. >>>>>> >>>>>> OK, so we have fine-grained ILOCK serialisation during the clone for >>>>>> access/modification to the extent list. Excellent, I think we can >>>>>> make this work. >>>>>> >>>>>> So: >>>>>> >>>>>> 1. take IOLOCK_EXCL like we already do on the source and destination >>>>>> files. >>>>>> >>>>>> 2. Once all the pre work is done, set a "clone in progress" flag on >>>>>> the in-memory source inode. >>>>>> >>>>>> 3. atomically demote the source inode IOLOCK_EXCL to IOLOCK_SHARED. >>>>>> >>>>>> 4. read IO and the clone serialise access to the extent list via the >>>>>> ILOCK. We know this works fine, because that's how the extent list >>>>>> access serialisation for concurrent read and write direct IO works. >>>>>> >>>>>> 5. buffered writes take the IOLOCK_EXCL, so they block until the >>>>>> clone completes. Same behaviour as right now, all good. >>>>> >>>>> I think pnfs layouts and DAX writes also take IOLOCK_EXCL, right? So >>>>> once reflink breaks the layouts, we're good there too? >>>> >>>> I think so. >>>> >>>> <looks to confirm> >>>> >>>> The pnfs code in xfs_fs_map_blocks() will reject mappings on any >>>> inode marked with shared extents, so I think the fact that we >>>> set the inode as having shared extents before we finish >>>> xfs_reflink_remap_prep() will cause pnfs mappings to kick out before >>>> we even take the IOLOCK. >>>> >>>> But, regardless of that, both new PNFS mappings and DAX writes use >>>> IOLOCK_EXCL, and xfs_ilock2_io_mmap() breaks both PNFS and DAX >>>> layouts which will force them to finish what they are doing and sync >>>> data before the clone operation grabs the IOLOCK_EXCL. They'll block >>>> on the clone holding the IOLOCK from that point onwards, so I think >>>> we're good here. >>>> >>>> hmmmmm. >>>> >>>> <notes that xfs_ilock2_io_mmap() calls filemap_invalidate_lock_two()> >>>> >>>> Sigh. >>>> >>>> That will block buffered reads trying to instantiate new pages >>>> in the page cache. However, this isn't why the invalidate lock is >>>> held - that's being held to lock out lock page faults (i.e. mmap() >>>> access) whilst the clone is running. >>>> >>>> >>>> We really only need to lock out mmap writes, and the only way to do >>>> that is to prevent write faults from making progress whilst the >>>> clone is running. >>>> >>>> __xfs_filemap_fault() currently takes XFS_MMAPLOCK_SHARED for write >>>> faults - I think we need it to look at the "clone in progress" flag >>>> for write faults, too, and use XFS_MMAPLOCK_EXCL in that case. >>>> >>>> That would then allow us to demote the invalidate lock on the source >>>> file the same way we do the IOLOCK, allowing buffered reads to >>>> populate the page caceh but have write faults block until the clone >>>> completes (as they do now, same as writes). >>>> >>>> Is there anything else I missed? >>> >>> I think that's it. I'd wondered how much we really care about reflink >>> stalling read faults, but yeah, let's fix both. >> >> Well, it's not so much about mmap as the fact that holding >> invalidate lock exclusive prevents adding or removing folios to the >> page cache from any path. Hence the change as I originally proposed >> would block the buffered read path trying to add pages to the page >> cache the same as it will block the read fault path.... > > Ah, ok. > > Catherine: Do you have enough information to get started on a proof of > concept? Yup, this solution makes sense to me. > > --D > >> Cheers, >> >> Dave. >> >> -- >> Dave Chinner >> david@xxxxxxxxxxxxx