On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote: > On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote: > > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote: > > > XXX: how do we detect a iomap containing a cow mapping over a hole > > > in iomap_zero_iter()? The XFS code implies this case also needs to > > > zero the page cache if there is data present, so trigger for page > > > cache lookup only in iomap_zero_iter() needs to handle this case as > > > well. > > > > If there is no data in the page cache and either a whole or unwritten > > extent it really should not matter what is in the COW fork, a there > > obviously isn't any data we could zero. > > > > If there is data in the page cache for something that is marked as > > a hole in the srcmap, but we have data in the COW fork due to > > COW extsize preallocation we'd need to zero it, but as the > > xfs iomap ops don't return a separate srcmap for that case we > > should be fine. Or am I missing something? > > It might be useful to skip the scan for dirty pagecache if both forks > have holes, since (in theory) that's never possible on xfs. > > OTOH maybe there are filesystems that allow dirty pagecache over a hole? > IIRC there was a case where dirty cache can exist over what is reported as a hole to zero range. I want to say it was something like a COW prealloc over a data fork hole followed by a buffered write and then a zero range, but I don't recall the details. That is all something that should be fixed on the lookup side anyways. Brian > > > + * Note: when zeroing unwritten extents, we might have data in the page cache > > > + * over an unwritten extent. In this case, we want to do a pure lookup on the > > > + * page cache and not create a new folio as we don't need to perform zeroing on > > > + * unwritten extents if there is no cached data over the given range. > > > */ > > > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) > > > { > > > fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; > > > > > > + if (iter->flags & IOMAP_ZERO) { > > > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > > > + > > > + if (srcmap->type == IOMAP_UNWRITTEN) > > > + fgp &= ~FGP_CREAT; > > > + } > > > > Nit: The comment would probably stand out a little better if it was > > right next to the IOMAP_ZERO conditional instead of above the > > function. > > Agreed. > > > > + if (status) { > > > + if (status == -ENOENT) { > > > + /* > > > + * Unwritten extents need to have page cache > > > + * lookups done to determine if they have data > > > + * over them that needs zeroing. If there is no > > > + * data, we'll get -ENOENT returned here, so we > > > + * can just skip over this index. > > > + */ > > > + WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN); > > > > I'd return -EIO if the WARN_ON triggers. > > > > > +loop_continue: > > > > While I'm no strange to gotos for loop control something trips me > > up about jumping to the end of the loop. Here is what I could come > > up with instead. Not arguing it's objectively better, but I somehow > > like it a little better: > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 700b22d6807783..81378f7cd8d7ff 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > bool ret; > > > > status = iomap_write_begin(iter, pos, bytes, &folio); > > - if (status) { > > - if (status == -ENOENT) { > > - /* > > - * Unwritten extents need to have page cache > > - * lookups done to determine if they have data > > - * over them that needs zeroing. If there is no > > - * data, we'll get -ENOENT returned here, so we > > - * can just skip over this index. > > - */ > > - WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN); > > - if (bytes > PAGE_SIZE - offset_in_page(pos)) > > - bytes = PAGE_SIZE - offset_in_page(pos); > > - goto loop_continue; > > - } > > + if (status && status != -ENOENT) > > return status; > > - } > > - if (iter->iomap.flags & IOMAP_F_STALE) > > - break; > > > > - offset = offset_in_folio(folio, pos); > > - if (bytes > folio_size(folio) - offset) > > - bytes = folio_size(folio) - offset; > > + if (status == -ENOENT) { > > + /* > > + * If we end up here, we did not find a folio in the > > + * page cache for an unwritten extent and thus can > > + * skip over the range. > > + */ > > + if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN)) > > + return -EIO; > > > > - /* > > - * If the folio over an unwritten extent is clean (i.e. because > > - * it has been read from), then it already contains zeros. Hence > > - * we can just skip it. > > - */ > > - if (srcmap->type == IOMAP_UNWRITTEN && > > - !folio_test_dirty(folio)) { > > - folio_unlock(folio); > > - goto loop_continue; > > + /* > > + * XXX: It would be nice if we could get the offset of > > + * the next entry in the pagecache so that we don't have > > + * to iterate one page at a time here. > > + */ > > + offset = offset_in_page(pos); > > + if (bytes > PAGE_SIZE - offset) > > + bytes = PAGE_SIZE - offset; > > Why is it PAGE_SIZE here and not folio_size() like below? > > (I know you're just copying the existing code; I'm merely wondering if > this is some minor bug.) > > --D > > > + } else { > > + if (iter->iomap.flags & IOMAP_F_STALE) > > + break; > > + > > + offset = offset_in_folio(folio, pos); > > + if (bytes > folio_size(folio) - offset) > > + bytes = folio_size(folio) - offset; > > + > > + /* > > + * If the folio over an unwritten extent is clean (i.e. > > + * because it has only been read from), then it already > > + * contains zeros. Hence we can just skip it. > > + */ > > + if (srcmap->type == IOMAP_UNWRITTEN && > > + !folio_test_dirty(folio)) { > > + folio_unlock(folio); > > + status = -ENOENT; > > + } > > } > > > > - folio_zero_range(folio, offset, bytes); > > - folio_mark_accessed(folio); > > + if (status != -ENOENT) { > > + folio_zero_range(folio, offset, bytes); > > + folio_mark_accessed(folio); > > > > - ret = iomap_write_end(iter, pos, bytes, bytes, folio); > > - __iomap_put_folio(iter, pos, bytes, folio); > > - if (WARN_ON_ONCE(!ret)) > > - return -EIO; > > + ret = iomap_write_end(iter, pos, bytes, bytes, folio); > > + __iomap_put_folio(iter, pos, bytes, folio); > > + if (WARN_ON_ONCE(!ret)) > > + return -EIO; > > + } > > > > -loop_continue: > > pos += bytes; > > length -= bytes; > > written += bytes; > > >