Re: [LSF/MM/BPF TOPIC] Mapping text with large folios

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 20 Mar 2025 12:16:04 +0000

Appologies, I just sent a response to Dave that raises most of the same points
that Barry raises here. I'll read the full thread before replying further :)

On 19/03/2025 22:13, Barry Song wrote:
> On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>
>> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
>>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I know this is very last minute, but I was hoping that it might be possible to
>>>> squeeze in a session to discuss the following?
>>
>> I'm not going to be at LSFMM, so I'd prefer this sort of thing get
>> discussed on the dev lists...
>>
>>>> Summary/Background:
>>>>
>>>> On arm64, physically contiguous and naturally aligned regions can take advantage
>>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file
>>>> regions containing text, current readahead behaviour often yields small,
>>>> misaligned folios, preventing this optimization. This proposal introduces a
>>>> special-case path for executable mappings, performing synchronous reads of an
>>>> architecture-chosen size into large folios (64 KB on arm64). Early performance
>>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9%
>>>> gains.
>>>
>>> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth
>>> adding to the tests.
>>>
>>>>
>>>> I’ve previously posted attempts to enable this performance improvement ([1],
>>>> [2]), but there were objections and conversation fizzled out. Now that I have
>>>> more compelling performance data, I’m hoping there is now stronger
>>>> justification, and we can find a path forwards.
>>>>
>>>> What I’d Like to Cover:
>>>>
>>>>  - Describe how text memory should ideally be mapped and why it benefits
>>>>    performance.
>>
>> I think the main people involved already understand this...
>>
>>>>  - Brief review of performance data.
>>
>> You don't need to convince me - there's 3 decades of evidence
>> proving that larger, fewer page table mappings for executables
>> results in better performance.
>>
>>>>  - Discuss options for the best way to encourage text into large folios:
>>>>      - Let the architecture request a preferred size
>>>>      - Extend VMA attributes to include preferred THP size hint
>>>>      - Provide a sysfs knob
>>>>      - Plug into the “mapping min folio order” infrastructure
>>>>      - Other approaches?
>>
>> Implement generic large folio/sequential PTE mapping optimisations
>> for each platform, then control it by letting the filesystem decide
>> what the desired mapping order and alignment should be for any given
>> inode mapping tree.
>>
>>> Did you try LBS? You can have 64K block size with LBS, it should
>>> create large folios for page cache so text should get large folios
>>> automatically (IIRC arm64 linker script has 64K alignment by default).
>>
>> We really don't want people using 64kB block size filesystems for
>> root filesystems - there are plenty of downsides to using huge block
>> sizes for filesytems that generally hold many tiny files.
> 
> Agreed. Large folios will be compatible with existing file systems and
> applications, which don’t always require userspace to adopt them.
> 
>>
>> However, I agree with the general principle that the fs should be
>> directing the inode mapping tree folio order behaviour.  i.e. the
>> filesystem already sets both the floor and the desired behaviour for
>> folio instantiation for any given inode mapping tree.
>>
>> It also needs to be able to instantiate large folios -before- the
>> executable is mapped into VMAs via mmap() because files can be read
>> into cache before they are run (e.g. boot time readahead hacks).
>> i.e. a mmap() time directive is too late to apply to the inode
>> mapping tree to guarantee optimal layout for PTE optimisation. It
>> also may not be possible to apply mmap() time directives due to
>> other filesystem constraints, so mmap() time directives may well end
>> up being unpredictable and unreliable....
>>
> 
> ELF loading and the linker may lead to readaheading a small portion
> of the code text before mmap(). However, once the executable files
> are large, the minor loss of large folios due to limited read-ahead of
> the text may not be substantial enough to justify consideration.
> 
> But "boot time readahead hacks" seem like something that can read
> ahead significantly. Unless we can modify these "boot time readahead
> hacks" to use mmap() with EXEC mapping, it seems we would need
> something at the sys_read() to apply the preferred size.
> 
>> There's also an obvious filesystem level trigger for enabling this
>> behaviour in a generic manner.  e.g. The filesystem can look at the
>> X perm bits on the inode at instantiation time and if they are set,
>> set a "desired order" value+flag on the mapping at inode cache
>> instantiation in addition to "min order".
>>
> 
> Not sure what proportion of an executable file is the text section. If it's
> less than 30% or 50%, it seems we might be allocating "preferred size"
> large folios to many other sections that may not benefit from them?
> 
> Also, a Bash shell script with executable permissions might get a
> preferred large folio size. This seems weird?
> 
> By the way, are .so files executable files, even though they may contain
> a lot of code? As I check my filesystems, it seems not:
> 
> /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13
> -rw-r--r-- 1 root root 133280 Jan 11  2023 libz.so.1.2.13
> 
> 
>> If a desired order is configured, the page cache read code can then
>> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
>> value to folio allocation. If that can't be allocated then it can
>> fall back to single page folios instead of failing.
>>
>> At this point, we will always optimistically try to allocate larger
>> folios for executables on all architectures. Architectures that
>> can optimise sequential PTE mappings can then simply add generic
>> support for large folio optimisation, and more efficient executable
>> mappings simply fall out of the generic support for efficient
>> mapping of large folios and filesystems preferring large folios for
>> executable inode mappings....
> 
> I feel this falls more within the scope of architecture and memory
> management rather than the filesystem. If possible, we should try
> to avoid modifying the filesystem code?
> 
>>
>> -Dave.
>> --
>> Dave Chinner
>> david@xxxxxxxxxxxxx
> 
> Thanks
> Barry