On Thu, Oct 24, 2024 at 01:41:14PM -0400, Brian Foster wrote: > On Thu, Oct 24, 2024 at 10:08:17AM -0700, Darrick J. Wong wrote: > > On Wed, Oct 23, 2024 at 10:30:29AM -0400, Brian Foster wrote: > > > iomap zero range performs a pagecache flush upon seeing unwritten > > > extents with dirty pagecache in order to determine accurate > > > subranges that require direct zeroing. This is to support an > > > optimization where clean, unwritten ranges are skipped as they are > > > already zero on-disk. > > > > > > Certain use cases for zero range are more sensitive to flush latency > > > than others. The kernel test robot recently reported a regression in > > > the following stress-ng workload on XFS: > > > > > > stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64 > > > > > > This workload involves a series of small, strided, write extending > > > writes. On XFS, this produces a pattern of allocating post-eof > > > speculative preallocation, converting preallocation to unwritten on > > > zero range calls, dirtying pagecache over the converted mapping, and > > > then repeating the sequence again from the updated EOF. This > > > basically produces a sequence of pagecache flushes on the partial > > > EOF block zeroing use case of zero range. > > > > > > To mitigate this problem, special case the EOF block zeroing use > > > case to prefer zeroing over a pagecache flush when the EOF folio is > > > already dirty. This brings most of the performance back by avoiding > > > flushes on write and truncate extension operations, while preserving > > > the ability for iomap to flush and properly process larger ranges. > > > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > --- > > > > > > Hi iomap maintainers, > > > > > > This is an incremental optimization for the regression reported by the > > > test robot here[1]. I'm not totally convinced this is necessary as an > > > immediate fix, but the discussion on that thread was enough to suggest > > > it could be. I don't really love the factoring, but I had to play a bit > > > of whack-a-mole between fstests and stress-ng to restore performance and > > > still maintain behavior expectations for some of the tests. > > > > > > On a positive note, exploring this gave me what I think is a better idea > > > for dealing with zero range overall, so I'm working on a followup to > > > this that reworks it by splitting zero range across block alignment > > > boundaries (similar to how something like truncate page range works, for > > > example). This simplifies things by isolating the dirty range check to a > > > single folio on an unaligned start offset, which lets the _iter() call > > > do a skip or zero (i.e. no more flush_and_stale()), and then > > > unconditionally flush the aligned portion to end-of-range. The latter > > > flush should be a no-op for every use case I've seen so far, so this > > > might entirely avoid the need for anything more complex for zero range. > > > > > > In summary, I'm posting this as an optional and more "stable-worthy" > > > patch for reference and for the maintainers to consider as they like. I > > > think it's reasonable to include if we are concerned about this > > > particular stress-ng test and are Ok with it as a transient solution. > > > But if it were up to me, I'd probably sit on it for a bit to determine > > > if a more practical user/workload is affected by this, particularly > > > knowing that I'm trying to rework it. This could always be applied as a > > > stable fix if really needed, but I just don't think the slightly more > > > invasive rework is appropriate for -rc.. > > > > > > Thoughts, reviews, flames appreciated. > > > > > > Brian > > > > > > [1] https://lore.kernel.org/linux-xfs/202410141536.1167190b-oliver.sang@xxxxxxxxx/ > > > > > > fs/iomap/buffered-io.c | 20 +++++++++++++++++--- > > > 1 file changed, 17 insertions(+), 3 deletions(-) > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index aa587b2142e2..8fd25b14d120 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > > @@ -1372,6 +1372,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > > > loff_t pos = iter->pos; > > > loff_t length = iomap_length(iter); > > > loff_t written = 0; > > > + bool eof_zero = false; > > > > > > /* > > > * We must zero subranges of unwritten mappings that might be dirty in > > > @@ -1391,12 +1392,23 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > > > * triggers writeback time post-eof zeroing. > > > */ > > > if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { > > > - if (*range_dirty) { > > > + /* range is clean and already zeroed, nothing to do */ > > > + if (!*range_dirty) > > > + return length; > > > + > > > + /* flush for anything other than partial eof zeroing */ > > > + if (pos != i_size_read(iter->inode) || > > > + (pos % i_blocksize(iter->inode)) == 0) { > > > *range_dirty = false; > > > return iomap_zero_iter_flush_and_stale(iter); > > > } > > > - /* range is clean and already zeroed, nothing to do */ > > > - return length; > > > + /* > > > + * Special case partial EOF zeroing. Since we know the EOF > > > + * folio is dirty, prefer in-memory zeroing for it. This avoids > > > + * excessive flush latency on frequent file size extending > > > + * operations. > > > + */ > > > + eof_zero = true; > > > } > > > > > > do { > > > @@ -1415,6 +1427,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > > > offset = offset_in_folio(folio, pos); > > > if (bytes > folio_size(folio) - offset) > > > bytes = folio_size(folio) - offset; > > > + if (eof_zero && length > bytes) > > > + length = bytes; > > > > What does this do? I think this causes the loop to break after putting > > the folio that caches @pos? And then I guess we go around the loop in > > iomap_zero_range again if there were more bytes to zero after this > > folio? > > > > Yeah.. it's basically just saying that if we fell into folio zeroing due > to the special case logic above, only process through the end of this > particular folio and jump back out to process the rest of the range as > normal. The idea was just to prevent going off and doing a bunch of > unexpected zeroing across an unwritten mapping just because we had an > unaligned range that starts with a dirty folio. > > FWIW, the reworked variant I have of this currently looks like the > appended diff. The caveat is this can still flush if a large folio > happens to overlap the two subranges, but as is seems to placate the > stress-ng test. In theory, I think having something like an > iomap_zero_folio(folio, start_pos, end_pos) that zeroed up through > min(end_pos, folio_end_pos) for the unaligned part would mitigate that, > but I'm not quite sure of a clean way to do that; particularly if we > have a large folio made up of multiple mappings. I'm also still > undecided on whether to unconditionally flush the rest or try to > preserve the flush_and_stale() approach as well. > > Brian > > --- 8< --- > ... And here's another variant that preserves the flush_and_stale() behavior. This one is compile tested only: diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index aa587b2142e2..37a27c344078 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1365,40 +1365,12 @@ static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i) return filemap_write_and_wait_range(mapping, i->pos, end); } -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, - bool *range_dirty) +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) { - const struct iomap *srcmap = iomap_iter_srcmap(iter); loff_t pos = iter->pos; loff_t length = iomap_length(iter); loff_t written = 0; - /* - * We must zero subranges of unwritten mappings that might be dirty in - * pagecache from previous writes. We only know whether the entire range - * was clean or not, however, and dirty folios may have been written - * back or reclaimed at any point after mapping lookup. - * - * The easiest way to deal with this is to flush pagecache to trigger - * any pending unwritten conversions and then grab the updated extents - * from the fs. The flush may change the current mapping, so mark it - * stale for the iterator to remap it for the next pass to handle - * properly. - * - * Note that holes are treated the same as unwritten because zero range - * is (ab)used for partial folio zeroing in some cases. Hole backed - * post-eof ranges can be dirtied via mapped write and the flush - * triggers writeback time post-eof zeroing. - */ - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { - if (*range_dirty) { - *range_dirty = false; - return iomap_zero_iter_flush_and_stale(iter); - } - /* range is clean and already zeroed, nothing to do */ - return length; - } - do { struct folio *folio; int status; @@ -1434,38 +1406,69 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, return written; } +static inline void +iomap_iter_init(struct iomap_iter *iter, struct inode *inode, loff_t pos, + loff_t len) +{ + memset(iter, 0, sizeof(*iter)); + iter->inode = inode; + iter->pos = pos; + iter->len = len; + iter->flags = IOMAP_ZERO; +} + int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, const struct iomap_ops *ops) { - struct iomap_iter iter = { - .inode = inode, - .pos = pos, - .len = len, - .flags = IOMAP_ZERO, - }; + struct iomap_iter iter; + struct address_space *mapping = inode->i_mapping; + unsigned int blocksize = i_blocksize(inode); + unsigned int off = pos & (blocksize - 1); + loff_t plen = min_t(loff_t, len, blocksize - off); + bool dirty; int ret; - bool range_dirty; + + iomap_iter_init(&iter, inode, pos, len); /* - * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but - * pagecache must be flushed to ensure stale data from previous - * buffered writes is not exposed. A flush is only required for certain - * types of mappings, but checking pagecache after mapping lookup is - * racy with writeback and reclaim. + * Zero range wants to skip mappings that are already zero on disk, but + * the only way to handle unwritten mappings covered by dirty pagecache + * is to flush and reprocess the converted mappings after I/O + * completion. * - * Therefore, check the entire range first and pass along whether any - * part of it is dirty. If so and an underlying mapping warrants it, - * flush the cache at that point. This trades off the occasional false - * positive (and spurious flush, if the dirty data and mapping don't - * happen to overlap) for simplicity in handling a relatively uncommon - * situation. + * The partial EOF zeroing use case is performance sensitive, so split + * and handle an unaligned start of the range separately. The dirty + * check tells the iter function whether it can skip or zero the folio + * without needing to flush. Larger ranges tend to have already been + * flushed by the filesystem, so flush the rest here as a safety measure + * and process as normal. */ - range_dirty = filemap_range_needs_writeback(inode->i_mapping, - pos, pos + len - 1); + if (off && + filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) { + iter.len = plen; + while ((ret = iomap_iter(&iter, ops)) > 0) + iter.processed = iomap_zero_iter(&iter, did_zero); + iomap_iter_init(&iter, inode, iter.pos, len - (iter.pos - pos)); + if (ret || !iter.len) + return ret; + } + + dirty = filemap_range_needs_writeback(mapping, iter.pos, + iter.pos + iter.len - 1); + while ((ret = iomap_iter(&iter, ops)) > 0) { + const struct iomap *s = iomap_iter_srcmap(&iter); + if (!(s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN)) { + iter.processed = iomap_zero_iter(&iter, did_zero); + continue; + } + iter.processed = iomap_length(&iter); + if (dirty) { + dirty = false; + iter.processed = iomap_zero_iter_flush_and_stale(&iter); + } + } - while ((ret = iomap_iter(&iter, ops)) > 0) - iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty); return ret; } EXPORT_SYMBOL_GPL(iomap_zero_range);