Re: [PATCH 6.1 036/197] readahead: avoid multiple marked readahead pages

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 20 Feb 2024 21:05:53 +0000

On Tue, Feb 20, 2024 at 09:49:55PM +0100, Greg Kroah-Hartman wrote:
> 6.1-stable review patch.  If anyone has any objections, please let me know.

Maybe hold off on this one?

kernel test robot noticed a -21.4% regression of vm-scalability.throughput on:
commit: ab4443fe3ca6298663a55c4a70efc6c3ce913ca6 ("readahead: avoid multiple marked readahead pages")

https://lore.kernel.org/linux-fsdevel/202402201642.c8d6bbc3-oliver.sang@xxxxxxxxx/

Not a definite no just yet; nobody's dug into it, but some caution
seems warranted.

> ------------------
> 
> From: Jan Kara <jack@xxxxxxx>
> 
> commit ab4443fe3ca6298663a55c4a70efc6c3ce913ca6 upstream.
> 
> ra_alloc_folio() marks a page that should trigger next round of async
> readahead.  However it rounds up computed index to the order of page being
> allocated.  This can however lead to multiple consecutive pages being
> marked with readahead flag.  Consider situation with index == 1, mark ==
> 1, order == 0.  We insert order 0 page at index 1 and mark it.  Then we
> bump order to 1, index to 2, mark (still == 1) is rounded up to 2 so page
> at index 2 is marked as well.  Then we bump order to 2, index is
> incremented to 4, mark gets rounded to 4 so page at index 4 is marked as
> well.  The fact that multiple pages get marked within a single readahead
> window confuses the readahead logic and results in readahead window being
> trimmed back to 1.  This situation is triggered in particular when maximum
> readahead window size is not a power of two (in the observed case it was
> 768 KB) and as a result sequential read throughput suffers.
> 
> Fix the problem by rounding 'mark' down instead of up.  Because the index
> is naturally aligned to 'order', we are guaranteed 'rounded mark' == index
> iff 'mark' is within the page we are allocating at 'index' and thus
> exactly one page is marked with readahead flag as required by the
> readahead code and sequential read performance is restored.
> 
> This effectively reverts part of commit b9ff43dd2743 ("mm/readahead: Fix
> readahead with large folios").  The commit changed the rounding with the
> rationale:
> 
> "...  we were setting the readahead flag on the folio which contains the
> last byte read from the block.  This is wrong because we will trigger
> readahead at the end of the read without waiting to see if a subsequent
> read is going to use the pages we just read."
> 
> Although this is true, the fact is this was always the case with read
> sizes not aligned to folio boundaries and large folios in the page cache
> just make the situation more obvious (and frequent).  Also for sequential
> read workloads it is better to trigger the readahead earlier rather than
> later.  It is true that the difference in the rounding and thus earlier
> triggering of the readahead can result in reading more for semi-random
> workloads.  However workloads really suffering from this seem to be rare.
> In particular I have verified that the workload described in commit
> b9ff43dd2743 ("mm/readahead: Fix readahead with large folios") of reading
> random 100k blocks from a file like:
> 
> [reader]
> bs=100k
> rw=randread
> numjobs=1
> size=64g
> runtime=60s
> 
> is not impacted by the rounding change and achieves ~70MB/s in both cases.
> 
> [jack@xxxxxxx: fix one more place where mark rounding was done as well]
>   Link: https://lkml.kernel.org/r/20240123153254.5206-1-jack@xxxxxxx
> Link: https://lkml.kernel.org/r/20240104085839.21029-1-jack@xxxxxxx
> Fixes: b9ff43dd2743 ("mm/readahead: Fix readahead with large folios")
> Signed-off-by: Jan Kara <jack@xxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: Guo Xuenan <guoxuenan@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> ---
>  mm/readahead.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -483,7 +483,7 @@ static inline int ra_alloc_folio(struct
>  
>  	if (!folio)
>  		return -ENOMEM;
> -	mark = round_up(mark, 1UL << order);
> +	mark = round_down(mark, 1UL << order);
>  	if (index == mark)
>  		folio_set_readahead(folio);
>  	err = filemap_add_folio(ractl->mapping, folio, index, gfp);
> @@ -591,7 +591,7 @@ static void ondemand_readahead(struct re
>  	 * It's the expected callback index, assume sequential access.
>  	 * Ramp up sizes, and push forward the readahead window.
>  	 */
> -	expected = round_up(ra->start + ra->size - ra->async_size,
> +	expected = round_down(ra->start + ra->size - ra->async_size,
>  			1UL << order);
>  	if (index == expected || index == (ra->start + ra->size)) {
>  		ra->start += ra->size;
> 
>