Re: [LSF/MM/BPF TOPIC] Mapping text with large folios

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 20 Mar 2025 07:38:05 +1100

On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> >
> > Hi All,
> >
> > I know this is very last minute, but I was hoping that it might be possible to
> > squeeze in a session to discuss the following?

I'm not going to be at LSFMM, so I'd prefer this sort of thing get
discussed on the dev lists...

> > Summary/Background:
> >
> > On arm64, physically contiguous and naturally aligned regions can take advantage
> > of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
> > regions containing text, current readahead behaviour often yields small,
> > misaligned folios, preventing this optimization. This proposal introduces a
> > special-case path for executable mappings, performing synchronous reads of an
> > architecture-chosen size into large folios (64 KB on arm64). Early performance
> > tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
> > gains.
> 
> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
> adding to the tests.
> 
> >
> > I’ve previously posted attempts to enable this performance improvement ([1],
> > [2]), but there were objections and conversation fizzled out. Now that I have
> > more compelling performance data, I’m hoping there is now stronger
> > justification, and we can find a path forwards.
> >
> > What I’d Like to Cover:
> >
> >  - Describe how text memory should ideally be mapped and why it benefits
> >    performance.

I think the main people involved already understand this...

> >  - Brief review of performance data.

You don't need to convince me - there's 3 decades of evidence
proving that larger, fewer page table mappings for executables
results in better performance.

> >  - Discuss options for the best way to encourage text into large folios:
> >      - Let the architecture request a preferred size
> >      - Extend VMA attributes to include preferred THP size hint
> >      - Provide a sysfs knob
> >      - Plug into the “mapping min folio order” infrastructure
> >      - Other approaches?

Implement generic large folio/sequential PTE mapping optimisations
for each platform, then control it by letting the filesystem decide
what the desired mapping order and alignment should be for any given
inode mapping tree.

> Did you try LBS? You can have 64K block size with LBS, it should
> create large folios for page cache so text should get large folios
> automatically (IIRC arm64 linker script has 64K alignment by default).

We really don't want people using 64kB block size filesystems for
root filesystems - there are plenty of downsides to using huge block
sizes for filesytems that generally hold many tiny files.

However, I agree with the general principle that the fs should be
directing the inode mapping tree folio order behaviour.  i.e. the
filesystem already sets both the floor and the desired behaviour for
folio instantiation for any given inode mapping tree.

It also needs to be able to instantiate large folios -before- the
executable is mapped into VMAs via mmap() because files can be read
into cache before they are run (e.g. boot time readahead hacks).
i.e. a mmap() time directive is too late to apply to the inode
mapping tree to guarantee optimal layout for PTE optimisation. It
also may not be possible to apply mmap() time directives due to
other filesystem constraints, so mmap() time directives may well end
up being unpredictable and unreliable....

There's also an obvious filesystem level trigger for enabling this
behaviour in a generic manner.  e.g. The filesystem can look at the
X perm bits on the inode at instantiation time and if they are set,
set a "desired order" value+flag on the mapping at inode cache
instantiation in addition to "min order".

If a desired order is configured, the page cache read code can then
pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
value to folio allocation. If that can't be allocated then it can
fall back to single page folios instead of failing.

At this point, we will always optimistically try to allocate larger
folios for executables on all architectures. Architectures that
can optimise sequential PTE mappings can then simply add generic
support for large folio optimisation, and more efficient executable
mappings simply fall out of the generic support for efficient
mapping of large folios and filesystems preferring large folios for
executable inode mappings....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx