On Mon, Feb 24, 2025 at 1:36 PM Kalesh Singh <kaleshsingh@xxxxxxxxxx> wrote: > > On Mon, Feb 24, 2025 at 8:52 AM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > > On Mon, Feb 24, 2025 at 05:31:16PM +0100, Jan Kara wrote: > > > On Mon 24-02-25 14:21:37, Lorenzo Stoakes wrote: > > > > On Mon, Feb 24, 2025 at 03:14:04PM +0100, Jan Kara wrote: > > > > > Hello! > > > > > > > > > > On Fri 21-02-25 13:13:15, Kalesh Singh via Lsf-pc wrote: > > > > > > Problem Statement > > > > > > =============== > > > > > > > > > > > > Readahead can result in unnecessary page cache pollution for mapped > > > > > > regions that are never accessed. Current mechanisms to disable > > > > > > readahead lack granularity and rather operate at the file or VMA > > > > > > level. This proposal seeks to initiate discussion at LSFMM to explore > > > > > > potential solutions for optimizing page cache/readahead behavior. > > > > > > > > > > > > > > > > > > Background > > > > > > ========= > > > > > > > > > > > > The read-ahead heuristics on file-backed memory mappings can > > > > > > inadvertently populate the page cache with pages corresponding to > > > > > > regions that user-space processes are known never to access e.g ELF > > > > > > LOAD segment padding regions. While these pages are ultimately > > > > > > reclaimable, their presence precipitates unnecessary I/O operations, > > > > > > particularly when a substantial quantity of such regions exists. > > > > > > > > > > > > Although the underlying file can be made sparse in these regions to > > > > > > mitigate I/O, readahead will still allocate discrete zero pages when > > > > > > populating the page cache within these ranges. These pages, while > > > > > > subject to reclaim, introduce additional churn to the LRU. This > > > > > > reclaim overhead is further exacerbated in filesystems that support > > > > > > "fault-around" semantics, that can populate the surrounding pages’ > > > > > > PTEs if found present in the page cache. > > > > > > > > > > > > While the memory impact may be negligible for large files containing a > > > > > > limited number of sparse regions, it becomes appreciable for many > > > > > > small mappings characterized by numerous holes. This scenario can > > > > > > arise from efforts to minimize vm_area_struct slab memory footprint. > > > > > > > Hi Jan, Lorenzo, thanks for the comments. > > > > > > OK, I agree the behavior you describe exists. But do you have some > > > > > real-world numbers showing its extent? I'm not looking for some artificial > > > > > numbers - sure bad cases can be constructed - but how big practical problem > > > > > is this? If you can show that average Android phone has 10% of these > > > > > useless pages in memory than that's one thing and we should be looking for > > > > > some general solution. If it is more like 0.1%, then why bother? > > > > > > > Once I revert a workaround that we currently have to avoid > fault-around for these regions (we don't have an out of tree solution > to prevent the page cache population); our CI which checks memory > usage after performing some common app user-journeys; reports > regressions as shown in the snippet below. Note, that the increases > here are only for the populated PTEs (bounded by VMA) so the actual > pollution is theoretically larger. > > Metric: perfetto_media.extractor#file-rss-avg > Increased by 7.495 MB (32.7%) > > Metric: perfetto_/system/bin/audioserver#file-rss-avg > Increased by 6.262 MB (29.8%) > > Metric: perfetto_/system/bin/mediaserver#file-rss-max > Increased by 8.325 MB (28.0%) > > Metric: perfetto_/system/bin/mediaserver#file-rss-avg > Increased by 8.198 MB (28.4%) > > Metric: perfetto_media.extractor#file-rss-max > Increased by 7.95 MB (33.6%) > > Metric: perfetto_/system/bin/incidentd#file-rss-avg > Increased by 0.896 MB (20.4%) > > Metric: perfetto_/system/bin/audioserver#file-rss-max > Increased by 6.883 MB (31.9%) > > Metric: perfetto_media.swcodec#file-rss-max > Increased by 7.236 MB (34.9%) > > Metric: perfetto_/system/bin/incidentd#file-rss-max > Increased by 1.003 MB (22.7%) > > Metric: perfetto_/system/bin/cameraserver#file-rss-avg > Increased by 6.946 MB (34.2%) > > Metric: perfetto_/system/bin/cameraserver#file-rss-max > Increased by 7.205 MB (33.8%) > > Metric: perfetto_com.android.nfc#file-rss-max > Increased by 8.525 MB (9.8%) > > Metric: perfetto_/system/bin/surfaceflinger#file-rss-avg > Increased by 3.715 MB (3.6%) > > Metric: perfetto_media.swcodec#file-rss-avg > Increased by 5.096 MB (27.1%) > > [...] > > The issue is widespread across processes because in order to support > larger page sizes Android has a requirement that the ELF segments are > at-least 16KB aligned, which lead to the padding regions (never > accessed). > > > > > > > Limitations of Existing Mechanisms > > > > > > =========================== > > > > > > > > > > > > fadvise(..., POSIX_FADV_RANDOM, ...): disables read-ahead for the > > > > > > entire file, rather than specific sub-regions. The offset and length > > > > > > parameters primarily serve the POSIX_FADV_WILLNEED [1] and > > > > > > POSIX_FADV_DONTNEED [2] cases. > > > > > > > > > > > > madvise(..., MADV_RANDOM, ...): Similarly, this applies on the entire > > > > > > VMA, rather than specific sub-regions. [3] > > > > > > Guard Regions: While guard regions for file-backed VMAs circumvent > > > > > > fault-around concerns, the fundamental issue of unnecessary page cache > > > > > > population persists. [4] > > > > > > > > > > Somewhere else in the thread you complain about readahead extending past > > > > > the VMA. That's relatively easy to avoid at least for readahead triggered > > > > > from filemap_fault() (i.e., do_async_mmap_readahead() and > > > > > do_sync_mmap_readahead()). I agree we could do that and that seems as a > > > > > relatively uncontroversial change. Note that if someone accesses the file > > > > > through standard read(2) or write(2) syscall or through different memory > > > > > mapping, the limits won't apply but such combinations of access are not > > > > > that common anyway. > > > > > > > > Hm I'm not sure sure, map elf files with different mprotect(), or mprotect() > > > > different portions of a file and suddenly you lose all the readahead for the > > > > rest even though you're reading sequentially? > > > > > > Well, you wouldn't loose all readahead for the rest. Just readahead won't > > > preread data underlying the next VMA so yes, you get a cache miss and have > > > to wait for a page to get loaded into cache when transitioning to the next > > > VMA but once you get there, you'll have readahead running at full speed > > > again. > > > > I'm aware of how readahead works (I _believe_ there's currently a > > pre-release of a book with a very extensive section on readahead written by > > somebody :P). > > > > Also been looking at it for file-backed guard regions recently, which is > > why I've been commenting here specifically as it's been on my mind lately, > > and also Kalesh's interest in this stems from a guard region 'scenario' > > (hence my cc). > > > > Anyway perhaps I didn't phrase this well - my concern is whether this might > > impact performance in real world scenarios, such as one where a VMA is > > mapped then mprotect()'d or mmap()'d in parts causing _separate VMAs_ of > > the same file, in sequential order. > > > > From Kalesh's LPC talk, unless I misinterpreted what he said, this is > > precisely what he's doing? I mean we'd not be talking here about mmap() > > behaviour with readahead otherwise. > > > > Granted, perhaps you'd only _ever_ be reading sequentially within a > > specific VMA's boundaries, rather than going from one to another (excluding > > PROT_NONE guards obviously) and that's very possible, if that's what you > > mean. > > > > But otherwise, surely this is a thing? And might we therefore be imposing > > unnecessary cache misses? > > > > Which is why I suggest... > > > > > > > > So yes, sequential read of a memory mapping of a file fragmented into many > > > VMAs will be somewhat slower. My impression is such use is rare (sequential > > > readers tend to use read(2) rather than mmap) but I could be wrong. > > > > > > > What about shared libraries with r/o parts and exec parts? > > > > > > > > I think we'd really need to do some pretty careful checking to ensure this > > > > wouldn't break some real world use cases esp. if we really do mostly > > > > readahead data from page cache. > > > > > > So I'm not sure if you are not conflating two things here because the above > > > sentence doesn't make sense to me :). Readahead is the mechanism that > > > brings data from underlying filesystem into the page cache. Fault-around is > > > the mechanism that maps into page tables pages present in the page cache > > > although they were not possibly requested by the page fault. By "do mostly > > > readahead data from page cache" are you speaking about fault-around? That > > > currently does not cross VMA boundaries anyway as far as I'm reading > > > do_fault_around()... > > > > ...that we test this and see how it behaves :) Which is literally all I > > am saying in the above. Ideally with representative workloads. > > > > I mean, I think this shouldn't be a controversial point right? Perhaps > > again I didn't communicate this well. But this is all I mean here. > > > > BTW, I understand the difference between readahead and fault-around, you can > > run git blame on do_fault_around() if you have doubts about that ;) > > > > And yes fault around is constrained to the VMA (and actually avoids > > crossing PTE boundaries). > > > > > > > > > > Regarding controlling readahead for various portions of the file - I'm > > > > > skeptical. In my opinion it would require too much bookeeping on the kernel > > > > > side for such a niche usecache (but maybe your numbers will show it isn't > > > > > such a niche as I think :)). I can imagine you could just completely > > > > > turn off kernel readahead for the file and do your special readahead from > > > > > userspace - I think you could use either userfaultfd for triggering it or > > > > > new fanotify FAN_PREACCESS events. > > > > > > Something like this would be ideal for the use case where uncompressed > ELF files are mapped directly from zipped APKs without extracting > them. (I don't have any real world number for this case atm). I also > don't know if the cache miss on the subsequent VMAs has significant > overhead in practice ... I'll try to collect some data for this. > > > > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh > > > > says, he is too!) I don't really see how we could avoid having to do that > > > > for this kind of case, but I may be missing something... > > > > > > I don't see why we would need to be increasing number of VMAs here at all. > > > With FAN_PREACCESS you get notification with file & offset when it's > > > accessed, you can issue readahead(2) calls based on that however you like. > > > Similarly you can ask for userfaults for the whole mapped range and handle > > > those. Now thinking more about this, this approach has the downside that > > > you cannot implement async readahead with it (once PTE is mapped to some > > > page it won't trigger notifications either with FAN_PREACCESS or with > > > UFFD). But with UFFD you could at least trigger readahead on minor faults. > > > > Yeah we're talking past each other on this, sorry I missed your point about > > fanotify there! > > > > uffd is probably not reasonably workable given overhead I would have > > thought. > > > > I am really unaware of how fanotify works so I mean cool if you can find a > > solution this way, awesome :) > > > > I'm just saying, if we need to somehow retain state about regions which > > should have adjusted readahead behaviour at a VMA level, I can't see how > > this could be done without VMA fragmentation and I'd rather we didn't. > > > > If we can avoid that great! > > Another possible way we can look at this: in the regressions shared > above by the ELF padding regions, we are able to make these regions > sparse (for *almost* all cases) -- solving the shared-zero page > problem for file mappings, would also eliminate much of this overhead. > So perhaps we should tackle this angle? If that's a more tangible > solution ? > > From the previous discussions that Matthew shared [7], it seems like > Dave proposed an alternative to moving the extents to the VFS layer to > invert the IO read path operations [8]. Maybe this is a move > approachable solution since there is precedence for the same in the > write path? > > [7] https://lore.kernel.org/linux-fsdevel/Zs97qHI-wA1a53Mm@xxxxxxxxxxxxxxxxxxxx/ > [8] https://lore.kernel.org/linux-fsdevel/ZtAPsMcc3IC1VaAF@xxxxxxxxxxxxxxxxxxx/ + cc: Dave Chinner > > Thanks, > Kalesh > > > > > > > > Honza > > > -- > > > Jan Kara <jack@xxxxxxxx> > > > SUSE Labs, CR