On Wed, Oct 23, 2024 at 10:30:29AM -0400, Brian Foster wrote: > iomap zero range performs a pagecache flush upon seeing unwritten > extents with dirty pagecache in order to determine accurate > subranges that require direct zeroing. This is to support an > optimization where clean, unwritten ranges are skipped as they are > already zero on-disk. > > Certain use cases for zero range are more sensitive to flush latency > than others. The kernel test robot recently reported a regression in > the following stress-ng workload on XFS: > > stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64 > > This workload involves a series of small, strided, write extending > writes. On XFS, this produces a pattern of allocating post-eof > speculative preallocation, converting preallocation to unwritten on > zero range calls, dirtying pagecache over the converted mapping, and > then repeating the sequence again from the updated EOF. This > basically produces a sequence of pagecache flushes on the partial > EOF block zeroing use case of zero range. > > To mitigate this problem, special case the EOF block zeroing use > case to prefer zeroing over a pagecache flush when the EOF folio is > already dirty. This brings most of the performance back by avoiding > flushes on write and truncate extension operations, while preserving > the ability for iomap to flush and properly process larger ranges. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > > Hi iomap maintainers, > > This is an incremental optimization for the regression reported by the > test robot here[1]. I'm not totally convinced this is necessary as an > immediate fix, but the discussion on that thread was enough to suggest > it could be. I don't really love the factoring, but I had to play a bit > of whack-a-mole between fstests and stress-ng to restore performance and > still maintain behavior expectations for some of the tests. > > On a positive note, exploring this gave me what I think is a better idea > for dealing with zero range overall, so I'm working on a followup to > this that reworks it by splitting zero range across block alignment > boundaries (similar to how something like truncate page range works, for > example). This simplifies things by isolating the dirty range check to a > single folio on an unaligned start offset, which lets the _iter() call > do a skip or zero (i.e. no more flush_and_stale()), and then > unconditionally flush the aligned portion to end-of-range. The latter > flush should be a no-op for every use case I've seen so far, so this > might entirely avoid the need for anything more complex for zero range. > > In summary, I'm posting this as an optional and more "stable-worthy" > patch for reference and for the maintainers to consider as they like. I > think it's reasonable to include if we are concerned about this > particular stress-ng test and are Ok with it as a transient solution. > But if it were up to me, I'd probably sit on it for a bit to determine > if a more practical user/workload is affected by this, particularly > knowing that I'm trying to rework it. This could always be applied as a > stable fix if really needed, but I just don't think the slightly more > invasive rework is appropriate for -rc.. > > Thoughts, reviews, flames appreciated. > > Brian > > [1] https://lore.kernel.org/linux-xfs/202410141536.1167190b-oliver.sang@xxxxxxxxx/ > > fs/iomap/buffered-io.c | 20 +++++++++++++++++--- > 1 file changed, 17 insertions(+), 3 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index aa587b2142e2..8fd25b14d120 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1372,6 +1372,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > loff_t pos = iter->pos; > loff_t length = iomap_length(iter); > loff_t written = 0; > + bool eof_zero = false; > > /* > * We must zero subranges of unwritten mappings that might be dirty in > @@ -1391,12 +1392,23 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > * triggers writeback time post-eof zeroing. > */ > if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { > - if (*range_dirty) { > + /* range is clean and already zeroed, nothing to do */ > + if (!*range_dirty) > + return length; > + > + /* flush for anything other than partial eof zeroing */ > + if (pos != i_size_read(iter->inode) || > + (pos % i_blocksize(iter->inode)) == 0) { > *range_dirty = false; > return iomap_zero_iter_flush_and_stale(iter); > } > - /* range is clean and already zeroed, nothing to do */ > - return length; > + /* > + * Special case partial EOF zeroing. Since we know the EOF > + * folio is dirty, prefer in-memory zeroing for it. This avoids > + * excessive flush latency on frequent file size extending > + * operations. > + */ > + eof_zero = true; > } > > do { > @@ -1415,6 +1427,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > offset = offset_in_folio(folio, pos); > if (bytes > folio_size(folio) - offset) > bytes = folio_size(folio) - offset; > + if (eof_zero && length > bytes) > + length = bytes; What does this do? I think this causes the loop to break after putting the folio that caches @pos? And then I guess we go around the loop in iomap_zero_range again if there were more bytes to zero after this folio? --D > > folio_zero_range(folio, offset, bytes); > folio_mark_accessed(folio); > -- > 2.46.2 > >