Granted, perhaps you'd only _ever_ be reading sequentially within a specific VMA's boundaries, rather than going from one to another (excluding PROT_NONE guards obviously) and that's very possible, if that's what you mean. But otherwise, surely this is a thing? And might we therefore be imposing unnecessary cache misses? Which is why I suggest... > > So yes, sequential read of a memory mapping of a file fragmented into many > VMAs will be somewhat slower. My impression is such use is rare (sequential > readers tend to use read(2) rather than mmap) but I could be wrong. > > > What about shared libraries with r/o parts and exec parts? > > > > I think we'd really need to do some pretty careful checking to ensure this > > wouldn't break some real world use cases esp. if we really do mostly > > readahead data from page cache. > > So I'm not sure if you are not conflating two things here because the above > sentence doesn't make sense to me :). Readahead is the mechanism that > brings data from underlying filesystem into the page cache. Fault-around is > the mechanism that maps into page tables pages present in the page cache > although they were not possibly requested by the page fault. By "do mostly > readahead data from page cache" are you speaking about fault-around? That > currently does not cross VMA boundaries anyway as far as I'm reading > do_fault_around()... ...that we test this and see how it behaves :) Which is literally all I am saying in the above. Ideally with representative workloads. I mean, I think this shouldn't be a controversial point right? Perhaps again I didn't communicate this well. But this is all I mean here. BTW, I understand the difference between readahead and fault-around, you can run git blame on do_fault_around() if you have doubts about that ;) And yes fault around is constrained to the VMA (and actually avoids crossing PTE boundaries). > > > > Regarding controlling readahead for various portions of the file - I'm > > > skeptical. In my opinion it would require too much bookeeping on the kernel > > > side for such a niche usecache (but maybe your numbers will show it isn't > > > such a niche as I think :)). I can imagine you could just completely > > > turn off kernel readahead for the file and do your special readahead from > > > userspace - I think you could use either userfaultfd for triggering it or > > > new fanotify FAN_PREACCESS events. > > > > I'm opposed to anything that'll proliferate VMAs (and from what Kalesh > > says, he is too!) I don't really see how we could avoid having to do that > > for this kind of case, but I may be missing something... > > I don't see why we would need to be increasing number of VMAs here at all. > With FAN_PREACCESS you get notification with file & offset when it's > accessed, you can issue readahead(2) calls based on that however you like. > Similarly you can ask for userfaults for the whole mapped range and handle > those. Now thinking more about this, this approach has the downside that > you cannot implement async readahead with it (once PTE is mapped to some > page it won't trigger notifications either with FAN_PREACCESS or with > UFFD). But with UFFD you could at least trigger readahead on minor faults. Yeah we're talking past each other on this, sorry I missed your point about fanotify there! uffd is probably not reasonably workable given overhead I would have thought. I am really unaware of how fanotify works so I mean cool if you can find a solution this way, awesome :) I'm just saying, if we need to somehow retain state about regions which should have adjusted readahead behaviour at a VMA level, I can't see how this could be done without VMA fragmentation and I'd rather we didn't. If we can avoid that great! > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR