Re: [RFC PATCH] mm: Add large folio support for async readahead

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 5 Nov 2024 11:01:01 +0800

On Tue, Nov 5, 2024 at 1:00 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Mon, Nov 04, 2024 at 10:30:15PM +0800, Yafang Shao wrote:
> > When testing large folio support with XFS on our servers, we observed that
> > only a few large folios are mapped when reading large files via mmap. This
> > behavior occurs because large folio support is currently implemented only
> > for sync readahead, not for async readahead. Consequently, while the
> > first filemap fault may map to a large folio, subsequent filemap faults are
> > mapped to regular folios. This can be verified with a simple test case, as
> > shown below:
>
> I awear I tested this and it worked at the time.  Here's what's supposed
> to happen:
>
> You touch the first byte of the mapping, and there's no page in the page
> cache.  So we go into the sync readahead path, and force it to read two
> PMDs.  From your email, I assume this is succeeding.  The readahead
> flag gets set on the second PMD so that when we get to 2MB, we go into
> the 'time to do more readahead' path, ie the 'do_async_mmap_readahead'
> function you patch below.
>
> Now, page_cache_async_ra() does this:
>
>         unsigned int order = folio_order(folio);
> ...
>         if (index == expected) {
>                 ra->start += ra->size;
>                 ra->size = get_next_ra_size(ra, max_pages);
>                 ra->async_size = ra->size;
>                 goto readit;
>         }
> readit:
>         ractl->_index = ra->start;
>         page_cache_ra_order(ractl, ra, order);
>
> So it should already be doing readahead of PMD size.  Can you dig into
> what's going wrong, now that you understand that this is supposed to
> work already?

Upon further analysis, it appears that the behavior depends on the
/sys/block/*/queue/read_ahead_kb setting. On my test server, this
parameter is set to 128KB. For the current approach to work without
modification, it would need to be 2MB-aligned.

However, I believe MADV_HUGEPAGE behavior should not be dependent on
the value of read_ahead_kb. It would be more robust if the kernel
automatically aligned it to 2MB when MADV_HUGEPAGE is enabled. The
following changes ensure compatibility with non-2MB-aligned
read_ahead_kb values:

diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..c024245497cc 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -641,9 +641,9 @@ void page_cache_async_ra(struct readahead_control *ractl,
        expected = round_down(ra->start + ra->size - ra->async_size,
                        1UL << order);
        if (index == expected) {
-               ra->start += ra->size;
-               ra->size = get_next_ra_size(ra, max_pages);
-               ra->async_size = ra->size;
+               ra->start += ALIGN(ra->size, HPAGE_PMD_NR);
+               ra->size = ALIGN(get_next_ra_size(ra, max_pages), HPAGE_PMD_NR);
+               ra->async_size = ALIGN(ra->size, HPAGE_PMD_NR);
                goto readit;
        }

In addition, adjustments may be needed in the sync readahead path to
allow MADV_HUGEPAGE readahead sizes to be tunable based on
read_ahead_kb. I will continue to analyze this and submit a revised
version accordingly.

--
Regards

Yafang