Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Mon, 24 Feb 2025 16:52:24 +0000

On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote:
> On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote:
> > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote:
> > > Hello!
> > >
> > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote:
> > > > Problem Statement
> > > > ===============
> > > >
> > > > Readahead can result in unnecessary page cache pollution for mapped
> > > > regions that are never accessed. Current mechanisms to disable
> > > > readahead lack granularity and rather operate at the file or VMA
> > > > level. This proposal seeks to initiate discussion at LSFMM to explore
> > > > potential solutions for optimizing page cache/readahead behavior.
> > > >
> > > >
> > > > Background
> > > > =========
> > > >
> > > > The read-ahead heuristics on file-backed memory mappings can
> > > > inadvertently populate the page cache with pages corresponding to
> > > > regions that user-space processes are known never to access e.g ELF
> > > > LOAD segment padding regions. While these pages are ultimately
> > > > reclaimable, their presence precipitates unnecessary I/O operations,
> > > > particularly when a substantial quantity of such regions exists.
> > > >
> > > > Although the underlying file can be made sparse in these regions to
> > > > mitigate I/O, readahead will still allocate discrete zero pages when
> > > > populating the page cache within these ranges. These pages, while
> > > > subject to reclaim, introduce additional churn to the LRU. This
> > > > reclaim overhead is further exacerbated in filesystems that support
> > > > "fault-around" semantics, that can populate the surrounding pages’
> > > > PTEs if found present in the page cache.
> > > >
> > > > While the memory impact may be negligible for large files containing a
> > > > limited number of sparse regions, it becomes appreciable for many
> > > > small mappings characterized by numerous holes. This scenario can
> > > > arise from efforts to minimize vm_area_struct slab memory footprint.
> > >
> > > OK, I agree the behavior you describe exists. But do you have some
> > > real-world numbers showing its extent? I'm not looking for some artificial
> > > numbers - sure bad cases can be constructed - but how big practical problem
> > > is this? If you can show that average Android phone has 10% of these
> > > useless pages in memory than that's one thing and we should be looking for
> > > some general solution. If it is more like 0.1%, then why bother?
> > >
> > > > Limitations of Existing Mechanisms
> > > > ===========================
> > > >
> > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > > > entire file, rather than specific sub-regions. The offset and length
> > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > > > POSIX_FADV_DONTNEED [2] cases.
> > > >
> > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > > > VMA, rather than specific sub-regions. [3]
> > > > Guard Regions: While guard regions for file-backed VMAs circumvent
> > > > fault-around concerns, the fundamental issue of unnecessary page cache
> > > > population persists. [4]
> > >
> > > Somewhere else in the thread you complain about readahead extending past
> > > the VMA. That's relatively easy to avoid at least for readahead triggered
> > > from filemap_fault() (i.e., do_async_mmap_readahead() and
> > > do_sync_mmap_readahead()). I agree we could do that and that seems as a
> > > relatively uncontroversial change. Note that if someone accesses the file
> > > through standard read(2) or write(2) syscall or through different memory
> > > mapping, the limits won't apply but such combinations of access are not
> > > that common anyway.
> >
> > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect()
> > different portions of a file and suddenly you lose all the readahead for the
> > rest even though you're reading sequentially?
>
> Well, you wouldn't loose all readahead for the rest. Just readahead won't
> preread data underlying the next VMA so yes, you get a cache miss and have
> to wait for a page to get loaded into cache when transitioning to the next
> VMA but once you get there, you'll have readahead running at full speed
> again.

I'm aware of how readahead works (I _believe_ there's currently a
pre-release of a book with a very extensive section on readahead written by
somebody :P).

Also been looking at it for file-backed guard regions recently, which is
why I've been commenting here specifically as it's been on my mind lately,
and also Kalesh's interest in this stems from a guard region 'scenario'
(hence my cc).

Anyway perhaps I didn't phrase this well - my concern is whether this might
impact performance in real world scenarios, such as one where a VMA is
mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of
the same file, in sequential order.