Re: [LSF/MM/BPF TOPIC] Mapping text with large folios

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 20 Mar 2025 12:13:26 +0000

On 19/03/2025 20:38, Dave Chinner wrote:
> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>>
>>> Hi All,
>>>
>>> I know this is very last minute, but I was hoping that it might be possible to
>>> squeeze in a session to discuss the following?
> 
> I'm not going to be at LSFMM, so I'd prefer this sort of thing get
> discussed on the dev lists...

I'd be happy to do it that way. Except it was you that raised the objections to
the original patch, then didn't engage with my responses [1]. So I was trying to
force the issue :)

[1] https://lore.kernel.org/all/bdde4008-60db-4717-a6b5-53d77ab76bdb@xxxxxxx/

> 
>>> Summary/Background:
>>>
>>> On arm64, physically contiguous and naturally aligned regions can take advantage
>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>>> regions containing text, current readahead behaviour often yields small,
>>> misaligned folios, preventing this optimization. This proposal introduces a
>>> special-case path for executable mappings, performing synchronous reads of an
>>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>>> gains.
>>
>> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
>> adding to the tests.
>>
>>>
>>> I’ve previously posted attempts to enable this performance improvement ([1],
>>> [2]), but there were objections and conversation fizzled out. Now that I have
>>> more compelling performance data, I’m hoping there is now stronger
>>> justification, and we can find a path forwards.
>>>
>>> What I’d Like to Cover:
>>>
>>>  - Describe how text memory should ideally be mapped and why it benefits
>>>    performance.
> 
> I think the main people involved already understand this...
> 
>>>  - Brief review of performance data.
> 
> You don't need to convince me - there's 3 decades of evidence
> proving that larger, fewer page table mappings for executables
> results in better performance.

Sure, I was just trying to set the scene; I think it's worth 1 slide...

> 
>>>  - Discuss options for the best way to encourage text into large folios:
>>>      - Let the architecture request a preferred size
>>>      - Extend VMA attributes to include preferred THP size hint
>>>      - Provide a sysfs knob
>>>      - Plug into the “mapping min folio order” infrastructure
>>>      - Other approaches?
> 
> Implement generic large folio/sequential PTE mapping optimisations
> for each platform, then control it by letting the filesystem decide
> what the desired mapping order and alignment should be for any given
> inode mapping tree.

I don't really understand what this has to do with the filesystem? The
filesystem provides a hard floor for the *permitted* folio size (to satisfy
BS>PS constraints). But it's the readahead that decides the actual folio sizes,
subject to meeting that constraint.

An ELF has multiple sections so setting a particular minimum folio size for the
file doesn't seem appropriate. And additionally for my use case, there is no
hard requirement for a minumum folio size; it's just a preference. We can safely
fall back to the file's minimum folio size if allocation of the preferred folio
size fails or if it would run off the end of the file, etc.

I suspect I've misunderstood your proposal, because the way I've interpretted
it, it makes no sense to me...

> 
>> Did you try LBS? You can have 64K block size with LBS, it should
>> create large folios for page cache so text should get large folios
>> automatically (IIRC arm64 linker script has 64K alignment by default).
> 
> We really don't want people using 64kB block size filesystems for
> root filesystems - there are plenty of downsides to using huge block
> sizes for filesytems that generally hold many tiny files.
> 
> However, I agree with the general principle that the fs should be
> directing the inode mapping tree folio order behaviour.  i.e. the
> filesystem already sets both the floor and the desired behaviour for
> folio instantiation for any given inode mapping tree.
> 
> It also needs to be able to instantiate large folios -before- the
> executable is mapped into VMAs via mmap() because files can be read
> into cache before they are run (e.g. boot time readahead hacks).
> i.e. a mmap() time directive is too late to apply to the inode
> mapping tree to guarantee optimal layout for PTE optimisation. It
> also may not be possible to apply mmap() time directives due to
> other filesystem constraints, so mmap() time directives may well end
> up being unpredictable and unreliable....

Agreed on this issue. A common manifestation of this issue is when user space
read()s the ELF header to figure out how to mmap it. The read() causes readahead
of multiple pages which end up in the page cache as (commonly) 16K folios, which
often overlap into the text section, so after the text section gets mmaped, the
folios already in the page cache will be faulted into the process as is.

We have separately been exploring the possibility of modifying the readahead()
syscall behavior; if user space is asking to readahead a large chunk, it makes
sense to use that as a hint that the region should be treated as a single object
and be read into the largest possible folios. Today if some of the requested
region is already in the page cache, readahead will only read the bits not
present. But it might be preferable to just drop the bits that are present and
re-read into large folio.

Of course you wouldn't want user space to issue readahead() calls for the
entirety of the text section. But if the binary were post-linked with BOLT so
some PGO solution that puts the hot code at the front of the section, the linker
could detect this and request readahead for the hot part.

But independent of the readahead() stuff, VM_EXEC is a good enough indicator to
control this best effort feature in my view; it is sufficent most of the time.
And indeed there is already precedent because readahead consumes MADV_HUGEPAGE
in exectly the same way.

> 
> There's also an obvious filesystem level trigger for enabling this
> behaviour in a generic manner.  e.g. The filesystem can look at the
> X perm bits on the inode at instantiation time and if they are set,
> set a "desired order" value+flag on the mapping at inode cache
> instantiation in addition to "min order".

My understanding is the the X permission only controls whether the kernel will
permit exec()ing the file. It doesn't prevent it from being mapped executable in
a process. And shared libraries usually don't set the X perm. So I'm not sure
this works.

> 
> If a desired order is configured, the page cache read code can then
> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
> value to folio allocation. If that can't be allocated then it can
> fall back to single page folios instead of failing.

I don't see FGP_TRY_ORDER in the source, is that new? Or are you proposing it as
an addition. I guess this would mainly just disable reclaim? I agree that this
wants to be a best effort allocation. I'm just disagreeing that we want to
direct the policy from the filesystem; why would we want to have to implement
the policy for all filesystems?

> 
> At this point, we will always optimistically try to allocate larger
> folios for executables on all architectures. Architectures that
> can optimise sequential PTE mappings can then simply add generic
> support for large folio optimisation, and more efficient executable
> mappings simply fall out of the generic support for efficient
> mapping of large folios and filesystems preferring large folios for
> executable inode mappings....

arm64 already has the large folio mapping optimizations. It's called "contpte";
it opportunistically sets the contiguous bit in the block of PTEs if the folio
size and alignment are acceptable.

It sounds to me like we agree on most of this, but disagree on where the policy
should be directed and based on what heuristic; filesystem + X perm bit, or
readahead + VM_EXEC bit.

Thanks,
Ryan

> 
> -Dave.