This does look sane to me. How much testing did this get? Especially for the block size < page sie case? Also adding Dave as he has spent a lot of time on this code. On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote: > iomap_do_writepage() sends pages past i_size through > folio_redirty_for_writepage(), which normally isn't a problem because > truncate and friends clean them very quickly. > > When the system a variety of cgroups, ^^^ This sentence does not parse ^^^ > we can end up in situations where > one cgroup has almost no dirty pages at all. This is especially common > in our XFS workloads in production because they tend to use O_DIRECT for > everything. > > We've hit storms where the redirty path hits millions of times in a few > seconds, on all a single file that's only ~40 pages long. This ends up > leading to long tail latencies for file writes because the page reclaim > workers are hogging the CPU from some kworkers bound to the same CPU. > > That's the theory anyway. We know the storms exist, but the tail > latencies are so sporadic that it's hard to have any certainty about the > cause without patching a large number of boxes. > > There are a few different problems here. First is just that I don't > understand how invalidating the page instead of redirtying might upset > the rest of the iomap/xfs world. Btrfs invalidates in this case, which > seems like the right thing to me, but we all have slightly different > sharp corners in the truncate path so I thought I'd ask for comments. > > Second is the VM should take wbc->pages_skipped into account, or use > some other way to avoid looping over and over. I think we actually want > both but I wanted to understand the page invalidation path first. > > Signed-off-by: Chris Mason <clm@xxxxxx> > Reported-by: Domas Mituzas <domas@xxxxxx> > --- > fs/iomap/buffered-io.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 8ce8720093b9..4a687a2a9ed9 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1482,10 +1482,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > pgoff_t end_index = isize >> PAGE_SHIFT; > > /* > - * Skip the page if it's fully outside i_size, e.g. due to a > - * truncate operation that's in progress. We must redirty the > - * page so that reclaim stops reclaiming it. Otherwise > - * iomap_vm_releasepage() is called on it and gets confused. > + * invalidate the page if it's fully outside i_size, e.g. > + * due to a truncate operation that's in progress. > * > * Note that the end_index is unsigned long. If the given > * offset is greater than 16TB on a 32-bit system then if we > @@ -1499,8 +1497,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > * offset is just equal to the EOF. > */ > if (folio->index > end_index || > - (folio->index == end_index && poff == 0)) > - goto redirty; > + (folio->index == end_index && poff == 0)) { > + folio_invalidate(folio, 0, folio_size(folio)); > + goto unlock; > + } > > /* > * The page straddles i_size. It must be zeroed out on each > @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > > redirty: > folio_redirty_for_writepage(wbc, folio); > +unlock: > folio_unlock(folio); > return 0; > } > -- > 2.30.2 > ---end quoted text---