[LSF/MM/BPF TOPIC] Optimizing Page Cache Readahead Behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi organizers of LSF/MM,

I realize this is a late submission, but I was hoping there might
still be a chance to have this topic considered for discussion.

Problem Statement
===============

Readahead can result in unnecessary page cache pollution for mapped
regions that are never accessed. Current mechanisms to disable
readahead lack granularity and rather operate at the file or VMA
level. This proposal seeks to initiate discussion at LSFMM to explore
potential solutions for optimizing page cache/readahead behavior.


Background
=========

The read-ahead heuristics on file-backed memory mappings can
inadvertently populate the page cache with pages corresponding to
regions that user-space processes are known never to access e.g ELF
LOAD segment padding regions. While these pages are ultimately
reclaimable, their presence precipitates unnecessary I/O operations,
particularly when a substantial quantity of such regions exists.

Although the underlying file can be made sparse in these regions to
mitigate I/O, readahead will still allocate discrete zero pages when
populating the page cache within these ranges. These pages, while
subject to reclaim, introduce additional churn to the LRU. This
reclaim overhead is further exacerbated in filesystems that support
"fault-around" semantics, that can populate the surrounding pages’
PTEs if found present in the page cache.

While the memory impact may be negligible for large files containing a
limited number of sparse regions, it becomes appreciable for many
small mappings characterized by numerous holes. This scenario can
arise from efforts to minimize vm_area_struct slab memory footprint.

Limitations of Existing Mechanisms
===========================

fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the
entire file, rather than specific sub-regions. The offset and length
parameters primarily serve the POSIX_FADV_WILLNEED [1] and
POSIX_FADV_DONTNEED [2] cases.

madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire
VMA, rather than specific sub-regions. [3]
Guard Regions: While guard regions for file-backed VMAs circumvent
fault-around concerns, the fundamental issue of unnecessary page cache
population persists. [4]

Empirical Demonstration
===================

Below is a simple program to demonstrate the issue. Assume that the
last 20 pages of the mapping is a region known to never be accessed
(perhaps a guard region).

cachestat is a simple C program I wrote that returns the nr_cached for
the entire file using the new cachestat() syscall [5].

cat pollute_page_cache.sh

#!/bin/bash

FILE="myfile.txt"

echo "Creating sparse file of size 25 pages"
truncate -s 100k $FILE

apparent_size=$(ls -lahs $FILE | awk '{ print $6 }')
echo "Apparent Size: $apparent_size"

real_size=$(ls -lahs $FILE | awk '{ print $1 }')
echo "Real Size: $real_size"

nr_cached=$(./cachestat $FILE | grep nr_cache: | awk '{ print $2 }')
echo "Number cached pages: $nr_cached"

echo "Reading first 5 pages..."
head -c 20k $FILE

nr_cached=$(./cachestat $FILE | grep nr_cache: | awk '{ print $2 }')
echo "Number cached pages: $nr_cached"

rm $FILE

-------

./pollute_page_cache.sh
Creating sparse file of size 25 pages
Apparent Size: 100K
Real Size: 0
Number cached pages: 0
Reading first 5 pages...
Number cached pages: 25


Thanks,
Kalesh

[1] https://github.com/torvalds/linux/blob/v6.14-rc3/mm/fadvise.c#L96
[2] https://github.com/torvalds/linux/blob/v6.14-rc3/mm/fadvise.c#L113
[3] https://github.com/torvalds/linux/blob/v6.14-rc3/mm/madvise.c#L1277
[4] https://lore.kernel.org/r/cover.1739469950.git.lorenzo.stoakes@xxxxxxxxxx/
[5] https://lore.kernel.org/r/20230503013608.2431726-3-nphamcs@xxxxxxxxx/





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux