Re: [LSF/MM/BPF TOPIC] Mapping text with large folios

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 20 Mar 2025 14:47:44 +0000

On 20/03/2025 00:53, Dave Chinner wrote:
> On Thu, Mar 20, 2025 at 11:13:11AM +1300, Barry Song wrote:
>> On Thu, Mar 20, 2025 at 9:38 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> On Wed, Mar 19, 2025 at 11:16:16AM -0700, Yang Shi wrote:
>>>> On Wed, Mar 19, 2025 at 8:39 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>> However, I agree with the general principle that the fs should be
>>> directing the inode mapping tree folio order behaviour.  i.e. the
>>> filesystem already sets both the floor and the desired behaviour for
>>> folio instantiation for any given inode mapping tree.
>>>
>>> It also needs to be able to instantiate large folios -before- the
>>> executable is mapped into VMAs via mmap() because files can be read
>>> into cache before they are run (e.g. boot time readahead hacks).
>>> i.e. a mmap() time directive is too late to apply to the inode
>>> mapping tree to guarantee optimal layout for PTE optimisation. It
>>> also may not be possible to apply mmap() time directives due to
>>> other filesystem constraints, so mmap() time directives may well end
>>> up being unpredictable and unreliable....
>>>
>>
>> ELF loading and the linker may lead to readaheading a small portion
>> of the code text before mmap(). However, once the executable files
>> are large, the minor loss of large folios due to limited read-ahead of
>> the text may not be substantial enough to justify consideration.
>>
>> But "boot time readahead hacks" seem like something that can read
>> ahead significantly. Unless we can modify these "boot time readahead
>> hacks" to use mmap() with EXEC mapping, it seems we would need
>> something at the sys_read() to apply the preferred size.
> 
> Yes, that's exactly what I said. :)
> 
> But you haven't understood the example I gave (ie.. boot time
> readahead). There are many ways to have executables cached without them
> being mapped executable. They get accessed by a linker during
> compilation of code. They get updated by the OS package manager.
> A backup or dedpulication program accesses them. A virus scanners
> reads it looking for trojans, etc.

But most of these other ways are sequentially reading or writing the file, so
readahead will work more or less as expected in these cases and quickly ramp up
to bigger and bigger folios, I think? So most of the file will end up in folios
at least as large as 64K. When mapped, arm64 will be able to set the contpte bit.

In my experience, it's only when we are faulting in memory due to execution that
the pattern becomes random access and readahead never reads ahead far enough to
use larger folios - that's the case that needs help.

> 
> i.e. there are lots of ways of getting executables cached that
> prevent optimal large folio formation if the filesystem doesn't
> directly control formation of said large folios.
> 
> Hence if we don't apply large folio selection criteria to -all-
> buffered IO (read, write and mmap), the result when mmap(EXEC)
> occurs is going to be .... unpredictable and no always optimal.
> 
> 
> 
> value.
> 
> So assuming that the cache is cold, we want filemap_fault() to
> allocate large folios from cache misses on read faults, yes?

Large folios of a preferred size, yes.

> 
> That lands us in do_sync_mmap_readahead(), and that has a bit of a
> problem w.r.t. large folios. it ends up calling:
> 
> 	page_cache_ra_order(.... new_order = 0)
> 
> This limits folio allocated by readahead to order-2 in size, unless
> the mapping was instantiated by the filesystem with a larger
> min_order. In which case if will use the larger min_order value.
> 
> Either way, we don't get the desired large folio size the arch wants
> to optimise the page table mappings.
> 
> I'd suggest this would be fixed by something like this in
> do_sync_mmap_readahead():
> 
> -	page_cache_ra_order(..., 0);
> +	new_order = 0;
> +	if (is_exec_mapping(vmf->vma->vm_flags))
> +		new_order = <arch specific optimal pte mapping order>
> +	page_cache_ra_order(..., new_order);

That's pretty much what my first attempt at upstreaming does. It's not quite
that straightforward though, because we also have to modify the readahead sync
and async sizes to read an exact multiple of 64K. Otherwise
page_cache_ra_order() will reduce the order of the folio(s) to fit the requested
data size. The "new_order" is only a target starting point.

My code follows the same pattern already used for MADV_HUGEPAGE mappings in
do_sync_mmap_readahead():

	/*
	 * Allow arch to request a preferred minimum folio order for executable
	 * memory. This can often be beneficial to performance if (e.g.) arm64
	 * can contpte-map the folio. Executable memory rarely benefits from
	 * read-ahead anyway, due to its random access nature.
	 */
	if (vm_flags & VM_EXEC) {
		int order = arch_wants_exec_folio_order();

		if (order >= 0) {
			fpin = maybe_unlock_mmap_for_io(vmf, fpin);
			ra->size = 1UL << order;
			ra->async_size = 0;
			ractl._index &= ~((unsigned long)ra->size - 1);
			page_cache_ra_order(&ractl, ra, order);
			return fpin;
		}
	}

On arm64, this would do a sync 64K read into a 64K folio most of the time.

> 
> And now the page cache will be populated with large folios of at
> least the order requested if filesystem can support folios of that
> size.
> 
> Unless I've misunderstood something (cold cache instantiation of
> 64kB folios is what you desired, isn't it?), that small change
> should largely make exec mappings behave the way you want...

So sounds like you support this proposed approach?

> 
>>> There's also an obvious filesystem level trigger for enabling this
>>> behaviour in a generic manner.  e.g. The filesystem can look at the
>>> X perm bits on the inode at instantiation time and if they are set,
>>> set a "desired order" value+flag on the mapping at inode cache
>>> instantiation in addition to "min order".
>>>
>>
>> Not sure what proportion of an executable file is the text section. If it's
>> less than 30% or 50%, it seems we might be allocating "preferred size"
>> large folios to many other sections that may not benefit from them?
>>
>> Also, a Bash shell script with executable permissions might get a
>> preferred large folio size. This seems weird?
> 
> But none of this is actually a problem at all.  Fewer, larger folios
> still means less page cache and memory reclaim management overhead
> even if there is no direct benefit from optimised page table
> mapping.
> 
> Also, we typically know the file size at mapping tree instantiation
> time and hence we could make a sane decision as to whether large
> folios should be used for any specific executable file.
> 
>> By the way, are .so files executable files, even though they may contain
>> a lot of code? As I check my filesystems, it seems not:
>>
>> /usr/lib/aarch64-linux-gnu # ls -l libz.so.1.2.13
>> -rw-r--r-- 1 root root 133280 Jan 11  2023 libz.so.1.2.13
> 
> True, I hadn't considered that.
> 
> Seems like fixing do_sync_mmap_readahead() might be the best way to
> go then....

OK sounds like we might be converging :)

Thanks,
Ryan

> 
>>> If a desired order is configured, the page cache read code can then
>>> pass a FGP_TRY_ORDER flag with the fgp_order set to the desired
>>> value to folio allocation. If that can't be allocated then it can
>>> fall back to single page folios instead of failing.
>>>
>>> At this point, we will always optimistically try to allocate larger
>>> folios for executables on all architectures. Architectures that
>>> can optimise sequential PTE mappings can then simply add generic
>>> support for large folio optimisation, and more efficient executable
>>> mappings simply fall out of the generic support for efficient
>>> mapping of large folios and filesystems preferring large folios for
>>> executable inode mappings....
>>
>> I feel this falls more within the scope of architecture and memory
>> management rather than the filesystem. If possible, we should try
>> to avoid modifying the filesystem code?
> 
> Large folios may be a MM construct, but you can't use them
> in the page cache without the backing filesystem being fully aware
> of them and the mm subsystem has to work within the constraints the
> filesystem places on large folios in the page cache.
> 
> If we need to change constraints or enact new policies around
> file IO specific large folio optimisations, then we definitely are
> going to need to modify both mm and filesystem code to implement
> them....
> 
> -Dave.