iomap zero range performs a pagecache flush upon seeing unwritten extents with dirty pagecache in order to determine accurate subranges that require direct zeroing. This is to support an optimization where clean, unwritten ranges are skipped as they are already zero on-disk. Certain use cases for zero range are more sensitive to flush latency than others. The kernel test robot recently reported a regression in the following stress-ng workload on XFS: stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64 This workload involves a series of small, strided, write extending writes. On XFS, this produces a pattern of allocating post-eof speculative preallocation, converting preallocation to unwritten on zero range calls, dirtying pagecache over the converted mapping, and then repeating the sequence again from the updated EOF. This basically produces a sequence of pagecache flushes on the partial EOF block zeroing use case of zero range. To mitigate this problem, special case the EOF block zeroing use case to prefer zeroing over a pagecache flush when the EOF folio is already dirty. This brings most of the performance back by avoiding flushes on write and truncate extension operations, while preserving the ability for iomap to flush and properly process larger ranges. Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> --- Hi iomap maintainers, This is an incremental optimization for the regression reported by the test robot here[1]. I'm not totally convinced this is necessary as an immediate fix, but the discussion on that thread was enough to suggest it could be. I don't really love the factoring, but I had to play a bit of whack-a-mole between fstests and stress-ng to restore performance and still maintain behavior expectations for some of the tests. On a positive note, exploring this gave me what I think is a better idea for dealing with zero range overall, so I'm working on a followup to this that reworks it by splitting zero range across block alignment boundaries (similar to how something like truncate page range works, for example). This simplifies things by isolating the dirty range check to a single folio on an unaligned start offset, which lets the _iter() call do a skip or zero (i.e. no more flush_and_stale()), and then unconditionally flush the aligned portion to end-of-range. The latter flush should be a no-op for every use case I've seen so far, so this might entirely avoid the need for anything more complex for zero range. In summary, I'm posting this as an optional and more "stable-worthy" patch for reference and for the maintainers to consider as they like. I think it's reasonable to include if we are concerned about this particular stress-ng test and are Ok with it as a transient solution. But if it were up to me, I'd probably sit on it for a bit to determine if a more practical user/workload is affected by this, particularly knowing that I'm trying to rework it. This could always be applied as a stable fix if really needed, but I just don't think the slightly more invasive rework is appropriate for -rc.. Thoughts, reviews, flames appreciated. Brian [1] https://lore.kernel.org/linux-xfs/202410141536.1167190b-oliver.sang@xxxxxxxxx/ fs/iomap/buffered-io.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index aa587b2142e2..8fd25b14d120 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1372,6 +1372,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, loff_t pos = iter->pos; loff_t length = iomap_length(iter); loff_t written = 0; + bool eof_zero = false; /* * We must zero subranges of unwritten mappings that might be dirty in @@ -1391,12 +1392,23 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, * triggers writeback time post-eof zeroing. */ if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { - if (*range_dirty) { + /* range is clean and already zeroed, nothing to do */ + if (!*range_dirty) + return length; + + /* flush for anything other than partial eof zeroing */ + if (pos != i_size_read(iter->inode) || + (pos % i_blocksize(iter->inode)) == 0) { *range_dirty = false; return iomap_zero_iter_flush_and_stale(iter); } - /* range is clean and already zeroed, nothing to do */ - return length; + /* + * Special case partial EOF zeroing. Since we know the EOF + * folio is dirty, prefer in-memory zeroing for it. This avoids + * excessive flush latency on frequent file size extending + * operations. + */ + eof_zero = true; } do { @@ -1415,6 +1427,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, offset = offset_in_folio(folio, pos); if (bytes > folio_size(folio) - offset) bytes = folio_size(folio) - offset; + if (eof_zero && length > bytes) + length = bytes; folio_zero_range(folio, offset, bytes); folio_mark_accessed(folio); -- 2.46.2