Re: [PATCH v2] mm/readahead: Fix large folio support in async readahead

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 11 Nov 2024 22:28:09 +0800

On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 08.11.24 15:17, Yafang Shao wrote:
> > When testing large folio support with XFS on our servers, we observed that
> > only a few large folios are mapped when reading large files via mmap.
> > After a thorough analysis, I identified it was caused by the
> > `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> > parameter is set to 128KB. After I tune it to 2MB, the large folio can
> > work as expected. However, I believe the large folio behavior should not be
> > dependent on the value of read_ahead_kb. It would be more robust if the
> > kernel can automatically adopt to it.
>
> Now I am extremely confused.
>
> Documentation/ABI/stable/sysfs-block:
>
> "[RW] Maximum number of kilobytes to read-ahead for filesystems on this
> block device."
>
>
> So, with your patch, will we also be changing the readahead size to
> exceed that, or simply allocate larger folios and not exceeding the
> readahead size (e.g., leaving them partially non-filled)?

Exceeding the readahead size for the MADV_HUGEPAGE case is
straightforward; this is what the current patch accomplishes.

>
> If you're also changing the readahead behavior to exceed the
> configuration parameter it would sound to me like "I am pushing the
> brake pedal and my care brakes; fix the brakes to adopt whether to brake
> automatically" :)
>
> Likely I am missing something here, and how the read_ahead_kb parameter
> is used after your patch.

The read_ahead_kb parameter continues to function for
non-MADV_HUGEPAGE scenarios, whereas special handling is required for
the MADV_HUGEPAGE case. It appears that we ought to update the
Documentation/ABI/stable/sysfs-block to reflect the changes related to
large folios, correct?

>
>
> >
> > With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
> > sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
> > /proc/meminfo are as follows:
> >
> > - before this patch
> >    FileHugePages:     18432 kB
> >    FilePmdMapped:      4096 kB
> >
> > - after this patch
> >    FileHugePages:   1067008 kB
> >    FilePmdMapped:   1048576 kB
> >
> > This shows that after applying the patch, the entire 1GB file is mapped to
> > huge pages. The stable list is CCed, as without this patch, large folios
> > don’t function optimally in the readahead path.
>  >> It's worth noting that if read_ahead_kb is set to a larger value
> that isn't
> > aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail to map
> > to hugepages.
> >
> > Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> > Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > Cc: stable@xxxxxxxxxxxxxxx
> >
> > ---
> >   mm/readahead.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > Changes:
> > v1->v2:
> > - Drop the align (Matthew)
> > - Improve commit log (Andrew)
> >
> > RFC->v1: https://lore.kernel.org/linux-mm/20241106092114.8408-1-laoar.shao@xxxxxxxxx/
> > - Simplify the code as suggested by Matthew
> >
> > RFC: https://lore.kernel.org/linux-mm/20241104143015.34684-1-laoar.shao@xxxxxxxxx/
> >
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 3dc6c7a128dd..9b8a48e736c6 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -385,6 +385,8 @@ static unsigned long get_next_ra_size(struct file_ra_state *ra,
> >               return 4 * cur;
> >       if (cur <= max / 2)
> >               return 2 * cur;
> > +     if (cur > max)
> > +             return cur;
> >       return max;
>
> Maybe something like
>
> return max_t(unsigned long, cur, max);
>
> might be more readable (likely "max()" cannot be used because of the
> local variable name "max" ...).
>
>
> ... but it's rather weird having a "max" and then returning something
> larger than the "max" ... especially with code like

Indeed, that could lead to confusion ;)

>
> "ra->size = get_next_ra_size(ra, max_pages);"
>
>
> Maybe we can improve that by renaming "max_pages" / "max" to what it
> actually is supposed to be (which I haven't quite understood yet).

Perhaps a more straightforward solution would be to implement it
directly at the callsite, as demonstrated below?

diff --git a/mm/readahead.c b/mm/readahead.c
index 3dc6c7a128dd..187efae95b02 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
                        1UL << order);
        if (index == expected) {
                ra->start += ra->size;
-               ra->size = get_next_ra_size(ra, max_pages);
+               /*
+                * Allow the actual size to exceed the readahead window for a
+                * large folio.
+                */
+               ra->size = get_next_ra_size(ra, max(max_pages, ra->size));
                ra->async_size = ra->size;
                goto readit;
        }


--
Regards
Yafang