Appologies, I just sent a response to Dave that raises most of the same points that Barry raises here. I'll read the full thread before replying further :) On 19/03/2025 22:13, Barry Song wrote: > On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> >> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote: >>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>> >>>> Hi All, >>>> >>>> I know this is very last minute, but I was hoping that it might be possible to >>>> squeeze in a session to discuss the following? >> >> I'm not going to be at LSFMM, so I'd prefer this sort of thing get >> discussed on the dev lists... >> >>>> Summary/Background: >>>> >>>> On arm64, physically contiguous and naturally aligned regions can take advantage >>>> of contpte mappings (e.g. 64 KB) to reduce iTLB pressure. However, for file >>>> regions containing text, current readahead behaviour often yields small, >>>> misaligned folios, preventing this optimization. This proposal introduces a >>>> special-case path for executable mappings, performing synchronous reads of an >>>> architecture-chosen size into large folios (64 KB on arm64). Early performance >>>> tests on real-world workloads (e.g. nginx, redis, kernel compilation) show ~2-9% >>>> gains. >>> >>> AFAIK, MySQL is quite sensitive to iTLB pressure. It should be worth >>> adding to the tests. >>> >>>> >>>> I’ve previously posted attempts to enable this performance improvement ([1], >>>> [2]), but there were objections and conversation fizzled out. Now that I have >>>> more compelling performance data, I’m hoping there is now stronger >>>> justification, and we can find a path forwards. >>>> >>>> What I’d Like to Cover: >>>> >>>> - Describe how text memory should ideally be mapped and why it benefits >>>> performance. >> >> I think the main people involved already understand this... >> >>>> - Brief review of performance data. >> >> You don't need to convince me - there's 3 decades of evidence >> proving that larger, fewer page table mappings for executables >> results in better performance. >> >>>> - Discuss options for the best way to encourage text into large folios: >>>> - Let the architecture request a preferred size >>>> - Extend VMA attributes to include preferred THP size hint >>>> - Provide a sysfs knob >>>> - Plug into the “mapping min folio order” infrastructure >>>> - Other approaches? >> >> Implement generic large folio/sequential PTE mapping optimisations >> for each platform, then control it by letting the filesystem decide >> what the desired mapping order and alignment should be for any given >> inode mapping tree. >> >>> Did you try LBS? You can have 64K block size with LBS, it should >>> create large folios for page cache so text should get large folios >>> automatically (IIRC arm64 linker script has 64K alignment by default). >> >> We really don't want people using 64kB block size filesystems for >> root filesystems - there are plenty of downsides to using huge block >> sizes for filesytems that generally hold many tiny files. > > Agreed. Large folios will be compatible with existing file systems and > applications, which don’t always require userspace to adopt them. > >> >> However, I agree with the general principle that the fs should be >> directing the inode mapping tree folio order behaviour. i.e. the >> filesystem already sets both the floor and the desired behaviour for >> folio instantiation for any given inode mapping tree. >> >> It also needs to be able to instantiate large folios -before- the >> executable is mapped into VMAs via mmap() because files can be read >> into cache before they are run (e.g. boot time readahead hacks). >> i.e. a mmap() time directive is too late to apply to the inode >> mapping tree to guarantee optimal layout for PTE optimisation. It >> also may not be possible to apply mmap() time directives due to >> other filesystem constraints, so mmap() time directives may well end >> up being unpredictable and unreliable.... >> > > ELF loading and the linker may lead to readaheading a small portion > of the code text before mmap(). However, once the executable files > are large, the minor loss of large folios due to limited read-ahead of > the text may not be substantial enough to justify consideration. > > But "boot time readahead hacks" seem like something that can read > ahead significantly. Unless we can modify these "boot time readahead > hacks" to use mmap() with EXEC mapping, it seems we would need > something at the sys_read() to apply the preferred size. > >> There's also an obvious filesystem level trigger for enabling this >> behaviour in a generic manner. e.g. The filesystem can look at the >> X perm bits on the inode at instantiation time and if they are set, >> set a "desired order" value+flag on the mapping at inode cache >> instantiation in addition to "min order". >> > > Not sure what proportion of an executable file is the text section. If it's > less than 30% or 50%, it seems we might be allocating "preferred size" > large folios to many other sections that may not benefit from them? > > Also, a Bash shell script with executable permissions might get a > preferred large folio size. This seems weird? > > By the way, are .so files executable files, even though they may contain > a lot of code? As I check my filesystems, it seems not: > > /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13 > -rw-r--r-- 1 root root 133280 Jan 11 2023 libz.so.1.2.13 > > >> If a desired order is configured, the page cache read code can then >> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired >> value to folio allocation. If that can't be allocated then it can >> fall back to single page folios instead of failing. >> >> At this point, we will always optimistically try to allocate larger >> folios for executables on all architectures. Architectures that >> can optimise sequential PTE mappings can then simply add generic >> support for large folio optimisation, and more efficient executable >> mappings simply fall out of the generic support for efficient >> mapping of large folios and filesystems preferring large folios for >> executable inode mappings.... > > I feel this falls more within the scope of architecture and memory > management rather than the filesystem. If possible, we should try > to avoid modifying the filesystem code? > >> >> -Dave. >> -- >> Dave Chinner >> david@xxxxxxxxxxxxx > > Thanks > Barry