The patch titled Subject: readahead: make sure sync readahead reads needed page has been added to the -mm mm-unstable branch. Its filename is readahead-make-sure-sync-readahead-reads-needed-page.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/readahead-make-sure-sync-readahead-reads-needed-page.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Jan Kara <jack@xxxxxxx> Subject: readahead: make sure sync readahead reads needed page Date: Tue, 25 Jun 2024 12:18:51 +0200 Patch series "mm: Fix various readahead quirks". When we were internally testing performance of recent kernels, we have noticed quite variable performance of readahead arising from various quirks in readahead code. So I went on a cleaning spree there. This is a batch of patches resulting out of that. A quick testing in my test VM with the following fio job file: [global] direct=0 ioengine=sync invalidate=1 blocksize=4k size=10g readwrite=read [reader] numjobs=1 shows that this patch series improves the throughput from variable one in 310-340 MB/s range to rather stable one at 350 MB/s. As a side effect these cleanups also address the issue noticed by Bruz Zhang [1]. [1] https://lore.kernel.org/all/20240618114941.5935-1-zhangpengpeng0808@xxxxxxxxx/ This patch (of 10): page_cache_sync_ra() is called when a folio we want to read is not in the page cache. It is expected that it creates the folio (and perhaps the following folios as well) and submits reads for them unless some error happens. However if index == ra->start + ra->size, ondemand_readahead() will treat the call as another async readahead hit. Thus ra->start will be advanced and we create pages and queue reads from ra->start + ra->size further. Consequentially the page at 'index' is not created and filemap_get_pages() has to always go through filemap_create_folio() path. This behavior has particularly unfortunate consequences when we have two IO threads sequentially reading from a shared file (as is the case when NFS serves sequential reads). In that case what can happen is: suppose ra->size == ra->async_size == 128, ra->start = 512 T1 T2 reads 128 pages at index 512 - hits async readahead mark filemap_readahead() ondemand_readahead() if (index == expected ...) ra->start = 512 + 128 = 640 ra->size = 128 ra->async_size = 128 page_cache_ra_order() blocks in ra_alloc_folio() reads 128 pages at index 640 - no page found page_cache_sync_readahead() ondemand_readahead() if (index == expected ...) ra->start = 640 + 128 = 768 ra->size = 128 ra->async_size = 128 page_cache_ra_order() submits reads from 768 - still no page found at index 640 filemap_create_folio() - goes on to index 641 page_cache_sync_readahead() ondemand_readahead() - founds ra is confused, trims is to small size finds pages were already inserted And as a result read performance suffers. Fix the problem by triggering async readahead case in ondemand_readahead() only if we are calling the function because we hit the readahead marker. In any other case we need to read the folio at 'index' and thus we cannot really use the current ra state. Note that the above situation could be viewed as a special case of file->f_ra state corruption. In fact two thread reading using the shared file can also seemingly corrupt file->f_ra in interesting ways due to concurrent access. I never saw that in practice and the fix is going to be much more complex so for now at least fix this practical problem while we ponder about the theoretically correct solution. Link: https://lkml.kernel.org/r/20240625100859.15507-1-jack@xxxxxxx Link: https://lkml.kernel.org/r/20240625101909.12234-1-jack@xxxxxxx Signed-off-by: Jan Kara <jack@xxxxxxx> Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx> Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/readahead.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/readahead.c~readahead-make-sure-sync-readahead-reads-needed-page +++ a/mm/readahead.c @@ -580,7 +580,7 @@ static void ondemand_readahead(struct re */ expected = round_down(ra->start + ra->size - ra->async_size, 1UL << order); - if (index == expected || index == (ra->start + ra->size)) { + if (folio && index == expected) { ra->start += ra->size; ra->size = get_next_ra_size(ra, max_pages); ra->async_size = ra->size; _ Patches currently in -mm which might be from jack@xxxxxxx are revert-mm-writeback-fix-possible-divide-by-zero-in-wb_dirty_limits-again.patch mm-avoid-overflows-in-dirty-throttling-logic.patch readahead-make-sure-sync-readahead-reads-needed-page.patch filemap-fix-page_cache_next_miss-when-no-hole-found.patch readahead-properly-shorten-readahead-when-falling-back-to-do_page_cache_ra.patch readahead-drop-pointless-index-from-force_page_cache_ra.patch readahead-drop-index-argument-of-page_cache_async_readahead.patch readahead-drop-dead-code-in-page_cache_ra_order.patch readahead-drop-dead-code-in-ondemand_readahead.patch readahead-disentangle-async-and-sync-readahead.patch readahead-fold-try_context_readahead-into-its-single-caller.patch readahead-simplify-gotos-in-page_cache_sync_ra.patch