Re: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Wed, 1 Jun 2022 05:18:49 -0700

This does look sane to me.  How much testing did this get?  Especially
for the block size < page sie case?  Also adding Dave as he has spent
a lot of time on this code.

On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote:
> iomap_do_writepage() sends pages past i_size through
> folio_redirty_for_writepage(), which normally isn't a problem because
> truncate and friends clean them very quickly.
> 
> When the system a variety of cgroups,

^^^ This sentence does not parse ^^^

> we can end up in situations where
> one cgroup has almost no dirty pages at all.  This is especially common
> in our XFS workloads in production because they tend to use O_DIRECT for
> everything.
> 
> We've hit storms where the redirty path hits millions of times in a few
> seconds, on all a single file that's only ~40 pages long.  This ends up
> leading to long tail latencies for file writes because the page reclaim
> workers are hogging the CPU from some kworkers bound to the same CPU.
> 
> That's the theory anyway.  We know the storms exist, but the tail
> latencies are so sporadic that it's hard to have any certainty about the
> cause without patching a large number of boxes.
> 
> There are a few different problems here.  First is just that I don't
> understand how invalidating the page instead of redirtying might upset
> the rest of the iomap/xfs world.  Btrfs invalidates in this case, which
> seems like the right thing to me, but we all have slightly different
> sharp corners in the truncate path so I thought I'd ask for comments.
> 
> Second is the VM should take wbc->pages_skipped into account, or use
> some other way to avoid looping over and over.  I think we actually want
> both but I wanted to understand the page invalidation path first.
> 
> Signed-off-by: Chris Mason <clm@xxxxxx>
> Reported-by: Domas Mituzas <domas@xxxxxx>
> ---
>  fs/iomap/buffered-io.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8ce8720093b9..4a687a2a9ed9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1482,10 +1482,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		pgoff_t end_index = isize >> PAGE_SHIFT;
>  
>  		/*
> -		 * Skip the page if it's fully outside i_size, e.g. due to a
> -		 * truncate operation that's in progress. We must redirty the
> -		 * page so that reclaim stops reclaiming it. Otherwise
> -		 * iomap_vm_releasepage() is called on it and gets confused.
> +		 * invalidate the page if it's fully outside i_size, e.g.
> +		 * due to a truncate operation that's in progress.
>  		 *
>  		 * Note that the end_index is unsigned long.  If the given
>  		 * offset is greater than 16TB on a 32-bit system then if we
> @@ -1499,8 +1497,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		 * offset is just equal to the EOF.
>  		 */
>  		if (folio->index > end_index ||
> -		    (folio->index == end_index && poff == 0))
> -			goto redirty;
> +		    (folio->index == end_index && poff == 0)) {
> +			folio_invalidate(folio, 0, folio_size(folio));
> +			goto unlock;
> +		}
>  
>  		/*
>  		 * The page straddles i_size.  It must be zeroed out on each
> @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  
>  redirty:
>  	folio_redirty_for_writepage(wbc, folio);
> +unlock:
>  	folio_unlock(folio);
>  	return 0;
>  }
> -- 
> 2.30.2
> 
---end quoted text---