Re: [PATCH v2] iomap: skip pages past eof in iomap_do_writepage()

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 9 Jun 2022 10:53:13 +1000

On Tue, Jun 07, 2022 at 05:42:29PM -0700, Chris Mason wrote:
> iomap_do_writepage() sends pages past i_size through
> folio_redirty_for_writepage(), which normally isn't a problem because
> truncate and friends clean them very quickly.
> 
> When the system has cgroups configured, we can end up in situations
> where one cgroup has almost no dirty pages at all, and other cgroups
> consume the entire background dirty limit.  This is especially common in
> our XFS workloads in production because they have cgroups using O_DIRECT
> for almost all of the IO mixed in with cgroups that do more traditional
> buffered IO work.
> 
> We've hit storms where the redirty path hits millions of times in a few
> seconds, on all a single file that's only ~40 pages long.  This leads to
> long tail latencies for file writes because the pdflush workers are
> hogging the CPU from some kworkers bound to the same CPU.
> 
> Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new
> eof page on truncate...") ends up writing/waiting most of these dirty pages
> before truncate gets a chance to wait on them.

That commit went into 5.10, so this would mean it's not easily
reproducable on kernels released since then?

> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8ce8720093b9..64d1476c457d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1482,10 +1482,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		pgoff_t end_index = isize >> PAGE_SHIFT;
>  
>  		/*
> -		 * Skip the page if it's fully outside i_size, e.g. due to a
> -		 * truncate operation that's in progress. We must redirty the
> -		 * page so that reclaim stops reclaiming it. Otherwise
> -		 * iomap_vm_releasepage() is called on it and gets confused.
> +		 * Skip the page if it's fully outside i_size, e.g.
> +		 * due to a truncate operation that's in progress.  We've
> +		 * cleaned this page and truncate will finish things off for
> +		 * us.
>  		 *
>  		 * Note that the end_index is unsigned long.  If the given
>  		 * offset is greater than 16TB on a 32-bit system then if we
> @@ -1500,7 +1500,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		 */
>  		if (folio->index > end_index ||
>  		    (folio->index == end_index && poff == 0))
> -			goto redirty;
> +			goto unlock;
>  
>  		/*
>  		 * The page straddles i_size.  It must be zeroed out on each
> @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  
>  redirty:
>  	folio_redirty_for_writepage(wbc, folio);
> +unlock:
>  	folio_unlock(folio);
>  	return 0;
>  }

Regardless, the change looks fine.

Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx