Re: [PATCH v2] mm/readahead: Fix large folio support in async readahead

David Hildenbrand <david@xxxxxxxxxx> · Wed, 13 Nov 2024 11:24:01 +0100

FWIW, I looked at "read_ahead_kb" values on my Fedora40 notebook and
they are all set to 128KiB. I'm not so sure if they really should be
that small ...

It depends on the use case. For our hardop servers, we set it to 4MB,
as they prioritize throughput over latency. However, for our
Kubernetes servers, we keep it at 128KB since those services are more
latency-sensitive, and increasing it could lead to more frequent
latency spikes.

Thanks for sharing.

or if large folio readahead code should just be able to
exceed it.

"mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if
we have no history of readahead being successful".

So not about exceeding the configured limit, but exceeding the
"readahead history".

So I consider VM_HUGEPAGE the sign here to "ignore readahead history"
and not to "violate the config".

MADV_HUGEPAGE is definitely a new addition to readahead, and its
behavior isn’t yet defined in the documentation. All we need to do is
clarify its behavior there. The documentation isn’t set in stone—we
can update it as long as it doesn’t disrupt existing applications.

If Willy thinks this is the way to go, then we should document that
MADV_HUGEPAGE may ignore the parameter, agreed.

I'll submit an additional patch to update the documentation for MADV_HUGEPAGE.

I still don't understand your one comment:

"It's worth noting that if read_ahead_kb is set to a larger value that
isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still
fail to map to hugepages."

Do you mean that MADV_HUGEPAGE+read_ahead_kb<=4M will give you 2M pages,
but MADV_HUGEPAGE+read_ahead_kb>4M won't? Or is this the case without
MADV_HUGEPAGE?

Typically, users set read_ahead_kb to aligned sizes, such as 128KB,
256KB, 512KB, 1MB, 2MB, 4MB, or 8MB. With this patch, MADV_HUGEPAGE
functions well for all these settings. However, if read_ahead_kb is
set to a non-hugepage-aligned size (e.g., 4MB + 128KB), MADV_HUGEPAGE
won’t work. This is because the initial readahead size for
MADV_HUGEPAGE is set to 4MB, as established in commit 4687fdbb805a:

    ra->size = HPAGE_PMD_NR;
    if (!(vmf->vma->vm_flags & VM_RAND_READ))
        ra->size *= 2;

However, as Willy noted, non-aligned settings are quite stupid, so we
should disregard them.

Right. What I've been wondering, to make this code easier to understand, 
if there should be some kind of ra->size_fixed=true parameter that tells 
readahead code to simply not mess with the ra->size until something 
changes. (below)

[...]

A quick tip for you: the readahead size already exceeds readahead_kb
even without MADV_HUGEPAGE. You might want to spend some time tracing
that behavior.

Care to save me some time and point me at what you mean?

I reached this conclusion by tracing ra->size in each
page_cache_ra_order() call, but I’m not fully equipped to provide all
the details ;）

I've been staring at the readhead code for 30 minutes and I am still 
lost. Either I'm too stupid for the code or the code is too complicated.

If we'd have something like

ra->start += ra->size;
/*
 * If someone like VM_HUGEPAGE fixed the size, just don't mess with it.
 */
if (!ra->size_fixed)
	ra->size = get_next_ra_size(ra, max_pages);
ra->async_size = ra->size;

It would be a lot clearer at least to me -- and we'd likely be able to 
get rid of the "unaligned readhahead" oddity.

If we'd grep for who sets "size_fixed", we could even figure out how all 
of this belongs together.

--
Cheers,

David / dhildenb