Re: [PATCH] mm/readahead: Fix large folio support in async readahead

Yafang Shao <laoar.shao@xxxxxxxxx> · Thu, 7 Nov 2024 13:55:52 +0800

On Thu, Nov 7, 2024 at 12:52 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Wed, Nov 06, 2024 at 05:21:14PM +0800, Yafang Shao wrote:
> > When testing large folio support with XFS on our servers, we observed that
> > only a few large folios are mapped when reading large files via mmap.
> > After a thorough analysis, I identified it was caused by the
> > `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
> > parameter is set to 128KB. After I tune it to 2MB, the large folio can
> > work as expected. However, I believe the large folio behavior should not be
> > dependent on the value of read_ahead_kb. It would be more robust if the
> > kernel can automatically adopt to it.
> >
> > With `/sys/block/*/queue/read_ahead_kb` set to a non-2MB aligned size,
> > this issue can be verified with a simple test case, as shown below:
>
> I don't like to see these programs in commit messages.  If it's a
> valuable program, it should go into tools/testing.  If not, it shouldn't
> be in the commit message.

I just want to make it easy for others to verify this change. I'm okay
with dropping this program.

>
> > When large folio support is enabled and read_ahead_kb is set to a smaller
> > value, ra->size (4MB) may exceed the maximum allowed size (e.g., 128KB). To
> > address this, we need to add a conditional check for such cases. However,
> > this alone is insufficient, as users might set read_ahead_kb to a larger,
> > non-hugepage-aligned value (e.g., 4MB + 128KB). In these instances, it is
> > essential to explicitly align ra->size with the hugepage size.
>
> I wish you'd discussed this in the earlier thread instead of just
> smashing it into this patch.  Because your solution is wrong.
>
> > @@ -642,7 +644,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >                       1UL << order);
> >       if (index == expected) {
> >               ra->start += ra->size;
> > -             ra->size = get_next_ra_size(ra, max_pages);
> > +             ra->size = ALIGN(get_next_ra_size(ra, max_pages), 1 << order);
>
> Let's suppose that someone sets read_ahead_kb to 192kB.  If the previous
> readahead did 128kB, we now try to align that to 128kB, so we'll readahead
> 256kB which is larger than max.  We were only intending to breach the
> 'max' for the MADV_HUGE case, not for all cases.

In the non-MADV_HUGE case, the order can only be 0, correct?

>
> Honestly, I don't know if we should try to defend a stupid sysadmin
> against the consequences of their misconfiguration like this.  I'd be in
> favour of getting rid of the configuration knob entirely (or just ignoring
> what the sysadmin set it to), but if we do that, we need to replace it
> with something that can automatically figure out what the correct setting
> for the readahead_max_kb should be (which is probably a function of the
> bandwidth, latency and seek time of the underlying device).
>
> But that's obviously not part of this patch.  I'd be in favour of just
> dropping this ALIGN and leaving just the first hunk of the patch.

I’m okay with removing this ALIGN, since we won’t be setting a large
read_ahead_kb value in our production environment. ;)

-- 
Regards
Yafang