iomap_do_writepage() sends pages past i_size through folio_redirty_for_writepage(), which normally isn't a problem because truncate and friends clean them very quickly. When the system a variety of cgroups, we can end up in situations where one cgroup has almost no dirty pages at all. This is especially common in our XFS workloads in production because they tend to use O_DIRECT for everything. We've hit storms where the redirty path hits millions of times in a few seconds, on all a single file that's only ~40 pages long. This ends up leading to long tail latencies for file writes because the page reclaim workers are hogging the CPU from some kworkers bound to the same CPU. That's the theory anyway. We know the storms exist, but the tail latencies are so sporadic that it's hard to have any certainty about the cause without patching a large number of boxes. There are a few different problems here. First is just that I don't understand how invalidating the page instead of redirtying might upset the rest of the iomap/xfs world. Btrfs invalidates in this case, which seems like the right thing to me, but we all have slightly different sharp corners in the truncate path so I thought I'd ask for comments. Second is the VM should take wbc->pages_skipped into account, or use some other way to avoid looping over and over. I think we actually want both but I wanted to understand the page invalidation path first. Signed-off-by: Chris Mason <clm@xxxxxx> Reported-by: Domas Mituzas <domas@xxxxxx> --- fs/iomap/buffered-io.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 8ce8720093b9..4a687a2a9ed9 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1482,10 +1482,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) pgoff_t end_index = isize >> PAGE_SHIFT; /* - * Skip the page if it's fully outside i_size, e.g. due to a - * truncate operation that's in progress. We must redirty the - * page so that reclaim stops reclaiming it. Otherwise - * iomap_vm_releasepage() is called on it and gets confused. + * invalidate the page if it's fully outside i_size, e.g. + * due to a truncate operation that's in progress. * * Note that the end_index is unsigned long. If the given * offset is greater than 16TB on a 32-bit system then if we @@ -1499,8 +1497,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) * offset is just equal to the EOF. */ if (folio->index > end_index || - (folio->index == end_index && poff == 0)) - goto redirty; + (folio->index == end_index && poff == 0)) { + folio_invalidate(folio, 0, folio_size(folio)); + goto unlock; + } /* * The page straddles i_size. It must be zeroed out on each @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) redirty: folio_redirty_for_writepage(wbc, folio); +unlock: folio_unlock(folio); return 0; } -- 2.30.2