On Mon, Sep 23, 2024 at 05:28:16PM +0200, Christoph Hellwig wrote: > Currently iomap_file_buffered_write_punch_delalloc can be called from > XFS either with the invalidate lock held or not. To fix this while > keeping the locking in the file system and not the iomap library > code we'll need to life the locking up into the file system. > > To prepare for that, open code iomap_file_buffered_write_punch_delalloc > in the only caller, and instead export iomap_write_delalloc_release. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > --- > .../filesystems/iomap/operations.rst | 2 +- > fs/iomap/buffered-io.c | 85 ++++++------------- > fs/xfs/xfs_iomap.c | 16 +++- > include/linux/iomap.h | 6 +- > 4 files changed, 46 insertions(+), 63 deletions(-) > > diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst > index 8e6c721d233010..b93115ab8748ae 100644 > --- a/Documentation/filesystems/iomap/operations.rst > +++ b/Documentation/filesystems/iomap/operations.rst > @@ -208,7 +208,7 @@ The filesystem must arrange to `cancel > such `reservations > <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/>`_ > because writeback will not consume the reservation. > -The ``iomap_file_buffered_write_punch_delalloc`` can be called from a > +The ``iomap_write_delalloc_release`` can be called from a > ``->iomap_end`` function to find all the clean areas of the folios > caching a fresh (``IOMAP_F_NEW``) delalloc mapping. > It takes the ``invalidate_lock``. > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 884891ac7a226c..237aeb883166df 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1149,6 +1149,32 @@ static void iomap_write_delalloc_scan(struct inode *inode, > * have dirty data still pending in the page cache - those are going to be > * written and so must still retain the delalloc backing for writeback. > * > + * When a short write occurs, the filesystem may need to remove reserved space > + * that was allocated in ->iomap_begin from it's ->iomap_end method. For "When a short write occurs, the filesystem may need to remove space reservations created in ->iomap_begin. > + * filesystems that use delayed allocation, we need to punch out delalloc > + * extents from the range that are not dirty in the page cache. As the write can > + * race with page faults, there can be dirty pages over the delalloc extent > + * outside the range of a short write but still within the delalloc extent > + * allocated for this iomap. > + * > + * The punch() callback *must* only punch delalloc extents in the range passed > + * to it. It must skip over all other types of extents in the range and leave > + * them completely unchanged. It must do this punch atomically with respect to > + * other extent modifications. Can a failing buffered write race with a write fault to the same file range? write() thread: page_mkwrite thread: --------------- -------------------- take i_rwsem ->iomap_begin create da reservation lock folio fail to write unlock folio take invalidation lock lock folio ->iomap_begin sees da reservation mark folio dirty unlock folio drop invalidation lock ->iomap_end take invalidation lock iomap_write_delalloc_release drop invalidation lock Can we end up in this situation, where the write fault thinks it has a dirty page backed by a delalloc reservation, yet the delalloc reservation gets removed by the delalloc punch logic? I think the answer to my question is that this sequence is impossible because the write fault dirties the folio so the iomap_write_delalloc_release does nothing, correct? Unrelated question about iomap_write_begin: Can we get rid of the !mapping_large_folio_support if-body just prior to __iomap_get_folio? filemap_get_folio won't return large folios if !mapping_large_folio_support, so I think the separate check in iomap isn't needed anymore? This push-down looks fine though, so Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> --D > + * > + * The punch() callback may be called with a folio locked to prevent writeback > + * extent allocation racing at the edge of the range we are currently punching. > + * The locked folio may or may not cover the range being punched, so it is not > + * safe for the punch() callback to lock folios itself. > + * > + * Lock order is: > + * > + * inode->i_rwsem (shared or exclusive) > + * inode->i_mapping->invalidate_lock (exclusive) > + * folio_lock() > + * ->punch > + * internal filesystem allocation lock > + * > * As we are scanning the page cache for data, we don't need to reimplement the > * wheel - mapping_seek_hole_data() does exactly what we need to identify the > * start and end of data ranges correctly even for sub-folio block sizes. This > @@ -1177,7 +1203,7 @@ static void iomap_write_delalloc_scan(struct inode *inode, > * require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose > * the code to subtle off-by-one bugs.... > */ > -static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte, > +void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte, > loff_t end_byte, unsigned flags, struct iomap *iomap, > iomap_punch_t punch) > { > @@ -1243,62 +1269,7 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte, > out_unlock: > filemap_invalidate_unlock(inode->i_mapping); > } > - > -/* > - * When a short write occurs, the filesystem may need to remove reserved space > - * that was allocated in ->iomap_begin from it's ->iomap_end method. For > - * filesystems that use delayed allocation, we need to punch out delalloc > - * extents from the range that are not dirty in the page cache. As the write can > - * race with page faults, there can be dirty pages over the delalloc extent > - * outside the range of a short write but still within the delalloc extent > - * allocated for this iomap. > - * > - * This function uses [start_byte, end_byte) intervals (i.e. open ended) to > - * simplify range iterations. > - * > - * The punch() callback *must* only punch delalloc extents in the range passed > - * to it. It must skip over all other types of extents in the range and leave > - * them completely unchanged. It must do this punch atomically with respect to > - * other extent modifications. > - * > - * The punch() callback may be called with a folio locked to prevent writeback > - * extent allocation racing at the edge of the range we are currently punching. > - * The locked folio may or may not cover the range being punched, so it is not > - * safe for the punch() callback to lock folios itself. > - * > - * Lock order is: > - * > - * inode->i_rwsem (shared or exclusive) > - * inode->i_mapping->invalidate_lock (exclusive) > - * folio_lock() > - * ->punch > - * internal filesystem allocation lock > - */ > -void iomap_file_buffered_write_punch_delalloc(struct inode *inode, > - loff_t pos, loff_t length, ssize_t written, unsigned flags, > - struct iomap *iomap, iomap_punch_t punch) > -{ > - loff_t start_byte; > - loff_t end_byte; > - > - if (iomap->type != IOMAP_DELALLOC) > - return; > - > - /* If we didn't reserve the blocks, we're not allowed to punch them. */ > - if (!(iomap->flags & IOMAP_F_NEW)) > - return; > - > - start_byte = iomap_last_written_block(inode, pos, written); > - end_byte = round_up(pos + length, i_blocksize(inode)); > - > - /* Nothing to do if we've written the entire delalloc extent */ > - if (start_byte >= end_byte) > - return; > - > - iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap, > - punch); > -} > -EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc); > +EXPORT_SYMBOL_GPL(iomap_write_delalloc_release); > > static loff_t iomap_unshare_iter(struct iomap_iter *iter) > { > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 1e11f48814c0d0..30f2530b6d5461 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -1227,8 +1227,20 @@ xfs_buffered_write_iomap_end( > unsigned flags, > struct iomap *iomap) > { > - iomap_file_buffered_write_punch_delalloc(inode, offset, length, written, > - flags, iomap, &xfs_buffered_write_delalloc_punch); > + loff_t start_byte, end_byte; > + > + /* If we didn't reserve the blocks, we're not allowed to punch them. */ > + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW)) > + return 0; > + > + /* Nothing to do if we've written the entire delalloc extent */ > + start_byte = iomap_last_written_block(inode, offset, written); > + end_byte = round_up(offset + length, i_blocksize(inode)); > + if (start_byte >= end_byte) > + return 0; > + > + iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap, > + xfs_buffered_write_delalloc_punch); > return 0; > } > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index e62df5d93f04de..137e0783faa224 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -290,9 +290,9 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, > > typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length, > struct iomap *iomap); > -void iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos, > - loff_t length, ssize_t written, unsigned flag, > - struct iomap *iomap, iomap_punch_t punch); > +void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte, > + loff_t end_byte, unsigned flags, struct iomap *iomap, > + iomap_punch_t punch); > > int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, > u64 start, u64 len, const struct iomap_ops *ops); > -- > 2.45.2 >