On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote: > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote: > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote: > > > Hello! > > > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote: > > > > Problem Statement > > > > =============== > > > > > > > > Readahead can result in unnecessary page cache pollution for mapped > > > > regions that are never accessed. Current mechanisms to disable > > > > readahead lack granularity and rather operate at the file or VMA > > > > level. This proposal seeks to initiate discussion at LSFMM to explore > > > > potential solutions for optimizing page cache/readahead behavior. > > > > > > > > > > > > Background > > > > ========= > > > > > > > > The read-ahead heuristics on file-backed memory mappings can > > > > inadvertently populate the page cache with pages corresponding to > > > > regions that user-space processes are known never to access e.g ELF > > > > LOAD segment padding regions. While these pages are ultimately > > > > reclaimable, their presence precipitates unnecessary I/O operations, > > > > particularly when a substantial quantity of such regions exists. > > > > > > > > Although the underlying file can be made sparse in these regions to > > > > mitigate I/O, readahead will still allocate discrete zero pages when > > > > populating the page cache within these ranges. These pages, while > > > > subject to reclaim, introduce additional churn to the LRU. This > > > > reclaim overhead is further exacerbated in filesystems that support > > > > "fault-around" semantics, that can populate the surrounding pages’ > > > > PTEs if found present in the page cache. > > > > > > > > While the memory impact may be negligible for large files containing a > > > > limited number of sparse regions, it becomes appreciable for many > > > > small mappings characterized by numerous holes. This scenario can > > > > arise from efforts to minimize vm_area_struct slab memory footprint. > > > > > > OK, I agree the behavior you describe exists. But do you have some > > > real-world numbers showing its extent? I'm not looking for some artificial > > > numbers - sure bad cases can be constructed - but how big practical problem > > > is this? If you can show that average Android phone has 10% of these > > > useless pages in memory than that's one thing and we should be looking for > > > some general solution. If it is more like 0.1%, then why bother? > > > > > > > Limitations of Existing Mechanisms > > > > =========================== > > > > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the > > > > entire file, rather than specific sub-regions. The offset and length > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and > > > > POSIX_FADV_DONTNEED [2] cases. > > > > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire > > > > VMA, rather than specific sub-regions. [3] > > > > Guard Regions: While guard regions for file-backed VMAs circumvent > > > > fault-around concerns, the fundamental issue of unnecessary page cache > > > > population persists. [4] > > > > > > Somewhere else in the thread you complain about readahead extending past > > > the VMA. That's relatively easy to avoid at least for readahead triggered > > > from filemap_fault() (i.e., do_async_mmap_readahead() and > > > do_sync_mmap_readahead()). I agree we could do that and that seems as a > > > relatively uncontroversial change. Note that if someone accesses the file > > > through standard read(2) or write(2) syscall or through different memory > > > mapping, the limits won't apply but such combinations of access are not > > > that common anyway. > > > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect() > > different portions of a file and suddenly you lose all the readahead for the > > rest even though you're reading sequentially? > > Well, you wouldn't loose all readahead for the rest. Just readahead won't > preread data underlying the next VMA so yes, you get a cache miss and have > to wait for a page to get loaded into cache when transitioning to the next > VMA but once you get there, you'll have readahead running at full speed > again. I'm aware of how readahead works (I _believe_ there's currently a pre-release of a book with a very extensive section on readahead written by somebody :P). Also been looking at it for file-backed guard regions recently, which is why I've been commenting here specifically as it's been on my mind lately, and also Kalesh's interest in this stems from a guard region 'scenario' (hence my cc). Anyway perhaps I didn't phrase this well - my concern is whether this might impact performance in real world scenarios, such as one where a VMA is mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of the same file, in sequential order.