On Tue, Jun 07, 2022 at 05:42:29PM -0700, Chris Mason wrote: > iomap_do_writepage() sends pages past i_size through > folio_redirty_for_writepage(), which normally isn't a problem because > truncate and friends clean them very quickly. > > When the system has cgroups configured, we can end up in situations > where one cgroup has almost no dirty pages at all, and other cgroups > consume the entire background dirty limit. This is especially common in > our XFS workloads in production because they have cgroups using O_DIRECT > for almost all of the IO mixed in with cgroups that do more traditional > buffered IO work. > > We've hit storms where the redirty path hits millions of times in a few > seconds, on all a single file that's only ~40 pages long. This leads to > long tail latencies for file writes because the pdflush workers are > hogging the CPU from some kworkers bound to the same CPU. > > Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new > eof page on truncate...") ends up writing/waiting most of these dirty pages > before truncate gets a chance to wait on them. That commit went into 5.10, so this would mean it's not easily reproducable on kernels released since then? > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 8ce8720093b9..64d1476c457d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1482,10 +1482,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > pgoff_t end_index = isize >> PAGE_SHIFT; > > /* > - * Skip the page if it's fully outside i_size, e.g. due to a > - * truncate operation that's in progress. We must redirty the > - * page so that reclaim stops reclaiming it. Otherwise > - * iomap_vm_releasepage() is called on it and gets confused. > + * Skip the page if it's fully outside i_size, e.g. > + * due to a truncate operation that's in progress. We've > + * cleaned this page and truncate will finish things off for > + * us. > * > * Note that the end_index is unsigned long. If the given > * offset is greater than 16TB on a 32-bit system then if we > @@ -1500,7 +1500,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > */ > if (folio->index > end_index || > (folio->index == end_index && poff == 0)) > - goto redirty; > + goto unlock; > > /* > * The page straddles i_size. It must be zeroed out on each > @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > > redirty: > folio_redirty_for_writepage(wbc, folio); > +unlock: > folio_unlock(folio); > return 0; > } Regardless, the change looks fine. Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx