On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote: > iomap_zero_range() flushes pagecache to mitigate consistency > problems with dirty pagecache and unwritten mappings. The flush is > unconditional over the entire range because checking pagecache state > after mapping lookup is racy with writeback and reclaim. There are > ways around this using iomap's mapping revalidation mechanism, but > this is not supported by all iomap based filesystems and so is not a > generic solution. > > There is another way around this limitation that is good enough to > filter the flush for most cases in practice. If we check for dirty > pagecache over the target range (instead of unconditionally flush), > we can keep track of whether the range was dirty before lookup and > defer the flush until/unless we see a combination of dirty cache > backed by an unwritten mapping. We don't necessarily know whether > the dirty cache was backed by the unwritten maping or some other > (written) part of the range, but the impliciation of a false > positive here is a spurious flush and thus relatively harmless. > > Note that we also flush for hole mappings because iomap_zero_range() > is used for partial folio zeroing in some cases. For example, if a > folio straddles EOF on a sub-page FSB size fs, the post-eof portion > is hole-backed and dirtied/written via mapped write, and then i_size > increases before writeback can occur (which otherwise zeroes the > post-eof portion of the EOF folio), then the folio becomes > inconsistent with disk until reclaimed. A flush in this case > executes partial zeroing from writeback, and iomap knows that there > is otherwise no I/O to submit for hole backed mappings. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++------- > 1 file changed, 48 insertions(+), 9 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 3e846f43ff48..a6e897e6e303 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1393,16 +1393,47 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, > } > EXPORT_SYMBOL_GPL(iomap_file_unshare); > > -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > +/* > + * Flush the remaining range of the iter and mark the current mapping stale. > + * This is used when zero range sees an unwritten mapping that may have had > + * dirty pagecache over it. > + */ > +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i) > +{ > + struct address_space *mapping = i->inode->i_mapping; > + loff_t end = i->pos + i->len - 1; > + > + i->iomap.flags |= IOMAP_F_STALE; > + return filemap_write_and_wait_range(mapping, i->pos, end); > +} > + > +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > + bool *range_dirty) > { > const struct iomap *srcmap = iomap_iter_srcmap(iter); > loff_t pos = iter->pos; > loff_t length = iomap_length(iter); > loff_t written = 0; > > - /* already zeroed? we're done. */ > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > + /* > + * We can skip pre-zeroed mappings so long as either the mapping was > + * clean before we started or we've flushed at least once since. > + * Otherwise we don't know whether the current mapping had dirty > + * pagecache, so flush it now, stale the current mapping, and proceed > + * from there. > + * > + * The hole case is intentionally included because this is (ab)used to > + * handle partial folio zeroing in some cases. Hole backed post-eof > + * ranges can be dirtied via mapped write and the flush triggers > + * writeback time post-eof zeroing. > + */ > + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { > + if (*range_dirty) { > + *range_dirty = false; > + return iomap_zero_iter_flush_and_stale(iter); > + } > return length; > + } > > do { > struct folio *folio; > @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > .flags = IOMAP_ZERO, > }; > int ret; > + bool range_dirty; > > /* > * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but > * pagecache must be flushed to ensure stale data from previous > - * buffered writes is not exposed. > + * buffered writes is not exposed. A flush is only required for certain > + * types of mappings, but checking pagecache after mapping lookup is > + * racy with writeback and reclaim. > + * > + * Therefore, check the entire range first and pass along whether any > + * part of it is dirty. If so and an underlying mapping warrants it, > + * flush the cache at that point. This trades off the occasional false > + * positive (and spurious flush, if the dirty data and mapping don't > + * happen to overlap) for simplicity in handling a relatively uncommon > + * situation. > */ > - ret = filemap_write_and_wait_range(inode->i_mapping, > - pos, pos + len - 1); > - if (ret) > - return ret; > + range_dirty = filemap_range_needs_writeback(inode->i_mapping, > + pos, pos + len - 1); > > while ((ret = iomap_iter(&iter, ops)) > 0) > - iter.processed = iomap_zero_iter(&iter, did_zero); > + iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty); Style nit: Could we do this flush-and-stale from the loop body instead of passing pointers around? e.g. static inline bool iomap_zero_need_flush(const struct iomap_iter *i) { const struct iomap *srcmap = iomap_iter_srcmap(iter); return srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN; } static inline int iomap_zero_iter_flush(struct iomap_iter *i) { struct address_space *mapping = i->inode->i_mapping; loff_t end = i->pos + i->len - 1; i->iomap.flags |= IOMAP_F_STALE; return filemap_write_and_wait_range(mapping, i->pos, end); } and then: range_dirty = filemap_range_needs_writeback(...); while ((ret = iomap_iter(&iter, ops)) > 0) { if (range_dirty && iomap_zero_need_flush(&iter)) { /* * Zero range wants to skip pre-zeroed (i.e. * unwritten) mappings, but... */ range_dirty = false; iter.processed = iomap_zero_iter_flush(&iter); } else { iter.processed = iomap_zero_iter(&iter, did_zero); } } The logic looks correct and sensible. :) --D > return ret; > } > EXPORT_SYMBOL_GPL(iomap_zero_range); > -- > 2.45.0 > >