Re: [LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

Kalesh Singh <kaleshsingh@xxxxxxxxxx> · Sat, 22 Feb 2025 21:36:48 -0800

On Sat, Feb 22, 2025 at 10:03 AM Kent Overstreet
<kent.overstreet@xxxxxxxxx> wrote:
>
> On Fri, Feb 21, 2025 at 01:13:15PM -0800, Kalesh Singh wrote:
> > Hi organizers of LSF/MM,
> >
> > I realize this is a late submission, but I was hoping there might
> > still be a chance to have this topic considered for discussion.
> >
> > Problem Statement
> > ===============
> >
> > Readahead can result in unnecessary page cache pollution for mapped
> > regions that are never accessed. Current mechanisms to disable
> > readahead lack granularity and rather operate at the file or VMA
> > level. This proposal seeks to initiate discussion at LSFMM to explore
> > potential solutions for optimizing page cache/readahead behavior.
> >
> >
> > Background
> > =========
> >
> > The read-ahead heuristics on file-backed memory mappings can
> > inadvertently populate the page cache with pages corresponding to
> > regions that user-space processes are known never to access e.g ELF
> > LOAD segment padding regions. While these pages are ultimately
> > reclaimable, their presence precipitates unnecessary I/O operations,
> > particularly when a substantial quantity of such regions exists.
> >
> > Although the underlying file can be made sparse in these regions to
> > mitigate I/O, readahead will still allocate discrete zero pages when
> > populating the page cache within these ranges. These pages, while
> > subject to reclaim, introduce additional churn to the LRU. This
> > reclaim overhead is further exacerbated in filesystems that support
> > "fault-around" semantics, that can populate the surrounding pages’
> > PTEs if found present in the page cache.
> >
> > While the memory impact may be negligible for large files containing a
> > limited number of sparse regions, it becomes appreciable for many
> > small mappings characterized by numerous holes. This scenario can
> > arise from efforts to minimize vm_area_struct slab memory footprint.
> >
> > Limitations of Existing Mechanisms
> > ===========================
> >
> > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
> > entire file, rather than specific sub-regions. The offset and length
> > parameters primarily serve the POSIX_FADV_WILLNEED [1] and
> > POSIX_FADV_DONTNEED [2] cases.
> >
> > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
> > VMA, rather than specific sub-regions. [3]
> > Guard Regions: While guard regions for file-backed VMAs circumvent
> > fault-around concerns, the fundamental issue of unnecessary page cache
> > population persists. [4]
>
Hi Kent. Thanks for taking a look at this.

> What if we introduced something like
>
> madvise(..., MADV_READAHEAD_BOUNDARY, offset)
>
> Would that be sufficient? And would a single readahead boundary offset
> suffice?

I like the idea of having boundaries. In this particular example the
single boundary suffices, though I think we’ll need to support
multiple (see below).

One requirement that we’d like to meet is that the solution doesn’t
cause VMA splits, to avoid additional slab usage, so perhaps fadvise()
is better suited to this?

Another behavior of “mmap readahead” is that it doesn’t really respect
VMA (start, end) boundaries:

The below demonstrates readahead past the end of the mapped region of the file:

sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
./pollute_page_cache.sh

Creating sparse file of size 25 pages
Apparent Size: 100K
Real Size: 0
Number cached pages: 0
Reading first 5 pages via mmap...
Mapping and reading pages: [0, 6) of file 'myfile.txt'
Number cached pages: 25

Similarly the readahead can bring in pages before the start of the
mapped region. I believe this is due to mmap “read-around” [6]:

sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
./pollute_page_cache.sh

Creating sparse file of size 25 pages
Apparent Size: 100K
Real Size: 0
Number cached pages: 0
Reading last 5 pages via mmap...
Mapping and reading pages: [20, 25) of file 'myfile.txt'
Number cached pages: 25

I’m not sure what the historical use cases for readahead past the VMA
boundaries are; but at least in some scenarios this behavior is not
desirable. For instance, many apps mmap uncompressed ELF files
directly from a page-aligned offset within a zipped APK as a space
saving and security feature. The read ahead and read around behaviors
lead to unrelated resources from the zipped APK populated in the page
cache. I think in this case we’ll need to have more than a single
boundary per file.

A somewhat related but separate issue is that currently distinct pages
are allocated in the page cache when reading sparse file holes. I
think at least in the case of reading this should be avoidable.

sudo sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' &&
./pollute_page_cache.sh

Creating sparse file of size 1GB
Apparent Size: 977M
Real Size: 0
Number cached pages: 0
Meminfo Cached:          9078768 kB
Reading 1GB of holes...
Number cached pages: 250000
Meminfo Cached:         10117324 kB

(10117324-9078768)/4 = 259639 = ~250000 pages # (global counter = some noise)

--Kalesh