Re: [PATCH] iomap: elide zero range flush from partial eof zeroing

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 24 Oct 2024 15:42:41 -0400

On Thu, Oct 24, 2024 at 01:41:14PM -0400, Brian Foster wrote:
> On Thu, Oct 24, 2024 at 10:08:17AM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 23, 2024 at 10:30:29AM -0400, Brian Foster wrote:
> > > iomap zero range performs a pagecache flush upon seeing unwritten
> > > extents with dirty pagecache in order to determine accurate
> > > subranges that require direct zeroing. This is to support an
> > > optimization where clean, unwritten ranges are skipped as they are
> > > already zero on-disk.
> > > 
> > > Certain use cases for zero range are more sensitive to flush latency
> > > than others. The kernel test robot recently reported a regression in
> > > the following stress-ng workload on XFS:
> > > 
> > >   stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
> > > 
> > > This workload involves a series of small, strided, write extending
> > > writes. On XFS, this produces a pattern of allocating post-eof
> > > speculative preallocation, converting preallocation to unwritten on
> > > zero range calls, dirtying pagecache over the converted mapping, and
> > > then repeating the sequence again from the updated EOF. This
> > > basically produces a sequence of pagecache flushes on the partial
> > > EOF block zeroing use case of zero range.
> > > 
> > > To mitigate this problem, special case the EOF block zeroing use
> > > case to prefer zeroing over a pagecache flush when the EOF folio is
> > > already dirty. This brings most of the performance back by avoiding
> > > flushes on write and truncate extension operations, while preserving
> > > the ability for iomap to flush and properly process larger ranges.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > ---
> > > 
> > > Hi iomap maintainers,
> > > 
> > > This is an incremental optimization for the regression reported by the
> > > test robot here[1]. I'm not totally convinced this is necessary as an
> > > immediate fix, but the discussion on that thread was enough to suggest
> > > it could be. I don't really love the factoring, but I had to play a bit
> > > of whack-a-mole between fstests and stress-ng to restore performance and
> > > still maintain behavior expectations for some of the tests.
> > > 
> > > On a positive note, exploring this gave me what I think is a better idea
> > > for dealing with zero range overall, so I'm working on a followup to
> > > this that reworks it by splitting zero range across block alignment
> > > boundaries (similar to how something like truncate page range works, for
> > > example). This simplifies things by isolating the dirty range check to a
> > > single folio on an unaligned start offset, which lets the _iter() call
> > > do a skip or zero (i.e. no more flush_and_stale()), and then
> > > unconditionally flush the aligned portion to end-of-range. The latter
> > > flush should be a no-op for every use case I've seen so far, so this
> > > might entirely avoid the need for anything more complex for zero range.
> > > 
> > > In summary, I'm posting this as an optional and more "stable-worthy"
> > > patch for reference and for the maintainers to consider as they like. I
> > > think it's reasonable to include if we are concerned about this
> > > particular stress-ng test and are Ok with it as a transient solution.
> > > But if it were up to me, I'd probably sit on it for a bit to determine
> > > if a more practical user/workload is affected by this, particularly
> > > knowing that I'm trying to rework it. This could always be applied as a
> > > stable fix if really needed, but I just don't think the slightly more
> > > invasive rework is appropriate for -rc..
> > > 
> > > Thoughts, reviews, flames appreciated.
> > > 
> > > Brian
> > > 
> > > [1] https://lore.kernel.org/linux-xfs/202410141536.1167190b-oliver.sang@xxxxxxxxx/
> > > 
> > >  fs/iomap/buffered-io.c | 20 +++++++++++++++++---
> > >  1 file changed, 17 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index aa587b2142e2..8fd25b14d120 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1372,6 +1372,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > >  	loff_t pos = iter->pos;
> > >  	loff_t length = iomap_length(iter);
> > >  	loff_t written = 0;
> > > +	bool eof_zero = false;
> > >  
> > >  	/*
> > >  	 * We must zero subranges of unwritten mappings that might be dirty in
> > > @@ -1391,12 +1392,23 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > >  	 * triggers writeback time post-eof zeroing.
> > >  	 */
> > >  	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> > > -		if (*range_dirty) {
> > > +		/* range is clean and already zeroed, nothing to do */
> > > +		if (!*range_dirty)
> > > +			return length;
> > > +
> > > +		/* flush for anything other than partial eof zeroing */
> > > +		if (pos != i_size_read(iter->inode) ||
> > > +		   (pos % i_blocksize(iter->inode)) == 0) {
> > >  			*range_dirty = false;
> > >  			return iomap_zero_iter_flush_and_stale(iter);
> > >  		}
> > > -		/* range is clean and already zeroed, nothing to do */
> > > -		return length;
> > > +		/*
> > > +		 * Special case partial EOF zeroing. Since we know the EOF
> > > +		 * folio is dirty, prefer in-memory zeroing for it. This avoids
> > > +		 * excessive flush latency on frequent file size extending
> > > +		 * operations.
> > > +		 */
> > > +		eof_zero = true;
> > >  	}
> > >  
> > >  	do {
> > > @@ -1415,6 +1427,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > >  		offset = offset_in_folio(folio, pos);
> > >  		if (bytes > folio_size(folio) - offset)
> > >  			bytes = folio_size(folio) - offset;
> > > +		if (eof_zero && length > bytes)
> > > +			length = bytes;
> > 
> > What does this do?  I think this causes the loop to break after putting
> > the folio that caches @pos?  And then I guess we go around the loop in
> > iomap_zero_range again if there were more bytes to zero after this
> > folio?
> > 
> 
> Yeah.. it's basically just saying that if we fell into folio zeroing due
> to the special case logic above, only process through the end of this
> particular folio and jump back out to process the rest of the range as
> normal. The idea was just to prevent going off and doing a bunch of
> unexpected zeroing across an unwritten mapping just because we had an
> unaligned range that starts with a dirty folio.
> 
> FWIW, the reworked variant I have of this currently looks like the
> appended diff. The caveat is this can still flush if a large folio
> happens to overlap the two subranges, but as is seems to placate the
> stress-ng test. In theory, I think having something like an
> iomap_zero_folio(folio, start_pos, end_pos) that zeroed up through
> min(end_pos, folio_end_pos) for the unaligned part would mitigate that,
> but I'm not quite sure of a clean way to do that; particularly if we
> have a large folio made up of multiple mappings. I'm also still
> undecided on whether to unconditionally flush the rest or try to
> preserve the flush_and_stale() approach as well.
> 
> Brian
> 
> --- 8< ---
> 
...

And here's another variant that preserves the flush_and_stale()
behavior. This one is compile tested only:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index aa587b2142e2..37a27c344078 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1365,40 +1365,12 @@ static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
 	return filemap_write_and_wait_range(mapping, i->pos, end);
 }
 
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
-		bool *range_dirty)
+static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 {
-	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
 
-	/*
-	 * We must zero subranges of unwritten mappings that might be dirty in
-	 * pagecache from previous writes. We only know whether the entire range
-	 * was clean or not, however, and dirty folios may have been written
-	 * back or reclaimed at any point after mapping lookup.
-	 *
-	 * The easiest way to deal with this is to flush pagecache to trigger
-	 * any pending unwritten conversions and then grab the updated extents
-	 * from the fs. The flush may change the current mapping, so mark it
-	 * stale for the iterator to remap it for the next pass to handle
-	 * properly.
-	 *
-	 * Note that holes are treated the same as unwritten because zero range
-	 * is (ab)used for partial folio zeroing in some cases. Hole backed
-	 * post-eof ranges can be dirtied via mapped write and the flush
-	 * triggers writeback time post-eof zeroing.
-	 */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
-		if (*range_dirty) {
-			*range_dirty = false;
-			return iomap_zero_iter_flush_and_stale(iter);
-		}
-		/* range is clean and already zeroed, nothing to do */
-		return length;
-	}
-
 	do {
 		struct folio *folio;
 		int status;
@@ -1434,38 +1406,69 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	return written;
 }
 
+static inline void
+iomap_iter_init(struct iomap_iter *iter, struct inode *inode, loff_t pos,
+		loff_t len)
+{
+	memset(iter, 0, sizeof(*iter));
+	iter->inode = inode;
+	iter->pos = pos;
+	iter->len = len;
+	iter->flags = IOMAP_ZERO;
+}
+
 int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops)
 {
-	struct iomap_iter iter = {
-		.inode		= inode,
-		.pos		= pos,
-		.len		= len,
-		.flags		= IOMAP_ZERO,
-	};
+	struct iomap_iter iter;
+	struct address_space *mapping = inode->i_mapping;
+	unsigned int blocksize = i_blocksize(inode);
+	unsigned int off = pos & (blocksize - 1);
+	loff_t plen = min_t(loff_t, len, blocksize - off);
+	bool dirty;
 	int ret;
-	bool range_dirty;
+
+	iomap_iter_init(&iter, inode, pos, len);
 
 	/*
-	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
-	 * pagecache must be flushed to ensure stale data from previous
-	 * buffered writes is not exposed. A flush is only required for certain
-	 * types of mappings, but checking pagecache after mapping lookup is
-	 * racy with writeback and reclaim.
+	 * Zero range wants to skip mappings that are already zero on disk, but
+	 * the only way to handle unwritten mappings covered by dirty pagecache
+	 * is to flush and reprocess the converted mappings after I/O
+	 * completion.
 	 *
-	 * Therefore, check the entire range first and pass along whether any
-	 * part of it is dirty. If so and an underlying mapping warrants it,
-	 * flush the cache at that point. This trades off the occasional false
-	 * positive (and spurious flush, if the dirty data and mapping don't
-	 * happen to overlap) for simplicity in handling a relatively uncommon
-	 * situation.
+	 * The partial EOF zeroing use case is performance sensitive, so split
+	 * and handle an unaligned start of the range separately. The dirty
+	 * check tells the iter function whether it can skip or zero the folio
+	 * without needing to flush. Larger ranges tend to have already been
+	 * flushed by the filesystem, so flush the rest here as a safety measure
+	 * and process as normal.
 	 */
-	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
-					pos, pos + len - 1);
+	if (off &&
+	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
+		iter.len = plen;
+		while ((ret = iomap_iter(&iter, ops)) > 0)
+			iter.processed = iomap_zero_iter(&iter, did_zero);
+		iomap_iter_init(&iter, inode, iter.pos, len - (iter.pos - pos));
+		if (ret || !iter.len)
+			return ret;
+	}
+
+	dirty = filemap_range_needs_writeback(mapping, iter.pos,
+					iter.pos + iter.len - 1);
+	while ((ret = iomap_iter(&iter, ops)) > 0) {
+		const struct iomap *s = iomap_iter_srcmap(&iter);
+		if (!(s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN)) {
+			iter.processed = iomap_zero_iter(&iter, did_zero);
+			continue;
+		}
+		iter.processed = iomap_length(&iter);
+		if (dirty) {
+			dirty = false;
+			iter.processed = iomap_zero_iter_flush_and_stale(&iter);
+		}
+	}
 
-	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);