[merged mm-stable] readahead-make-sure-sync-readahead-reads-needed-page.patch removed from -mm tree

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Fri, 28 Jun 2024 19:31:33 -0700

The quilt patch titled
     Subject: readahead: make sure sync readahead reads needed page
has been removed from the -mm tree.  Its filename was
     readahead-make-sure-sync-readahead-reads-needed-page.patch

This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

------------------------------------------------------
From: Jan Kara <jack@xxxxxxx>
Subject: readahead: make sure sync readahead reads needed page
Date: Tue, 25 Jun 2024 12:18:51 +0200

Patch series "mm: Fix various readahead quirks".

When we were internally testing performance of recent kernels, we have
noticed quite variable performance of readahead arising from various
quirks in readahead code.  So I went on a cleaning spree there.  This is a
batch of patches resulting out of that.  A quick testing in my test VM
with the following fio job file:

[global]
direct=0
ioengine=sync
invalidate=1
blocksize=4k
size=10g
readwrite=read

[reader]
numjobs=1

shows that this patch series improves the throughput from variable one in
310-340 MB/s range to rather stable one at 350 MB/s.  As a side effect
these cleanups also address the issue noticed by Bruz Zhang [1].

[1] https://lore.kernel.org/all/20240618114941.5935-1-zhangpengpeng0808@xxxxxxxxx/

Zhang Peng reported:

: I test this batch of patch with fio, it indeed has a huge sppedup
: in sequential read when block size is 4KiB. The result as follow,
: for async read, iodepth is set to 128, and other settings
: are self-evident.
: 
: casenameÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â  upstreamÂ Â  withFix speedup
: ----------------Â Â Â Â Â Â Â  --------Â Â  -------- -------
: randread-4k-syncÂ Â Â Â Â Â Â  48991Â Â Â Â Â  47
: seqread-4k-syncÂ Â Â Â Â Â Â Â  1162758Â Â Â  14229
: seqread-1024k-syncÂ Â Â Â Â  1460208Â Â Â  1452522 
: randread-4k-libaioÂ Â Â Â Â  47467Â Â Â Â Â  4730
: randread-4k-posixaioÂ Â Â  49190Â Â Â Â Â  49512
: seqread-4k-libaioÂ Â Â Â Â Â  1085932Â Â Â  1234635
: seqread-1024k-libaioÂ Â Â  1423341Â Â Â  1402214 -1
: seqread-4k-posixaioÂ Â Â Â  1165084Â Â Â  1369613 1
: seqread-1024k-posixaioÂ  1435422Â Â Â  1408808 -1.8


This patch (of 10):

page_cache_sync_ra() is called when a folio we want to read is not in the
page cache.  It is expected that it creates the folio (and perhaps the
following folios as well) and submits reads for them unless some error
happens.  However if index == ra->start + ra->size, ondemand_readahead()
will treat the call as another async readahead hit.  Thus ra->start will
be advanced and we create pages and queue reads from ra->start + ra->size
further.  Consequentially the page at 'index' is not created and
filemap_get_pages() has to always go through filemap_create_folio() path.

This behavior has particularly unfortunate consequences when we have two
IO threads sequentially reading from a shared file (as is the case when
NFS serves sequential reads).  In that case what can happen is:

suppose ra->size == ra->async_size == 128, ra->start = 512

T1					T2
reads 128 pages at index 512
  - hits async readahead mark
    filemap_readahead()
      ondemand_readahead()
        if (index == expected ...)
	  ra->start = 512 + 128 = 640
          ra->size = 128
	  ra->async_size = 128
	page_cache_ra_order()
	  blocks in ra_alloc_folio()
					reads 128 pages at index 640
					  - no page found
					  page_cache_sync_readahead()
					    ondemand_readahead()
					      if (index == expected ...)
						ra->start = 640 + 128 = 768
						ra->size = 128
						ra->async_size = 128
					    page_cache_ra_order()
					      submits reads from 768
					  - still no page found at index 640
					    filemap_create_folio()
					  - goes on to index 641
					  page_cache_sync_readahead()
					    ondemand_readahead()
					      - founds ra is confused,
					        trims is to small size
  	  finds pages were already inserted

And as a result read performance suffers.

Fix the problem by triggering async readahead case in ondemand_readahead()
only if we are calling the function because we hit the readahead marker. 
In any other case we need to read the folio at 'index' and thus we cannot
really use the current ra state.

Note that the above situation could be viewed as a special case of
file->f_ra state corruption.  In fact two thread reading using the shared
file can also seemingly corrupt file->f_ra in interesting ways due to
concurrent access.  I never saw that in practice and the fix is going to
be much more complex so for now at least fix this practical problem while
we ponder about the theoretically correct solution.

Link: https://lkml.kernel.org/r/20240625100859.15507-1-jack@xxxxxxx
Link: https://lkml.kernel.org/r/20240625101909.12234-1-jack@xxxxxxx
Signed-off-by: Jan Kara <jack@xxxxxxx>
Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
Tested-by: Zhang Peng <zhangpengpeng0808@xxxxxxxxx>
Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/readahead.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/readahead.c~readahead-make-sure-sync-readahead-reads-needed-page
+++ a/mm/readahead.c
@@ -580,7 +580,7 @@ static void ondemand_readahead(struct re
 	 */
 	expected = round_down(ra->start + ra->size - ra->async_size,
 			1UL << order);
-	if (index == expected || index == (ra->start + ra->size)) {
+	if (folio && index == expected) {
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max_pages);
 		ra->async_size = ra->size;
_

Patches currently in -mm which might be from jack@xxxxxxx are