On Wed, Feb 20, 2019 at 04:17:13AM -0700, William Kucharski wrote: > At present, the conventional methodology of reading a single base PAGE and > using readahead to fill in additional pages isn't useful as the entire (in my > prototype) PMD page needs to be read in before the page can be mapped (and at > that point it is unclear whether readahead of additional PMD sized pages would > be of benefit or too costly. I remember discussing readahead with Kirill in the past. Here's my understanding of how it works today and why it probably doesn't work any more once we have THPs. We mark some page partway to the current end of the readahead window with the ReadAhead page flag. Once we get to it, we trigger more readahead and change the location of the page with the ReadAhead flag. Our current RA window is on the order of 256kB. With THPs, we're mapping 2MB at a time. We don't get a warning every 256kB that we're getting close to the end of our RA window. We only get to know every 2MB. So either we can increase the RA window from 256kB to 2MB, or we have to manage without RA at all. Most systems these days have SSDs, so the whole concept of RA probably needs to be rethought. We should try to figure out if we care about the performance of rotating rust, and other high-latency systems like long-distance networking and USB sticks. (I was going to say network filesystems in general, but then I remembered clameter's example of 400Gbps networks being faster than DRAM, so local networks clearly aren't a problem any more). Maybe there's scope for a session on readahead in general, but I don't know who'd bring data and argue for what actions based on it. > Additionally, there are no good interfaces at present to tell filesystem layers > that content is desired in chunks larger than a hardcoded limit of 64K, or to > to read disk blocks in chunks appropriate for PMD sized pages. Right! It's actually slightly worse than that. The VFS allocates pages on behalf of the filesystem and tells the filesystem to read them. So there's no way to allow the filesystem to say "Actually, I'd rather read in 32kB chunks because that's how big my compression blocks are". See page_cache_read() and __do_page_cache_readahead(). I've mentioned in the past my preferred interface for solving this is to have a new address space operation called ->populate(). The VFS would call this from both of the above functions, allowing a filesystem to allocate, say, an order-3 page and place it in i_pages before starting IO on it. I haven't done any work towards this, though. And my opinion on it might change after having written some code. That interface would need to have some hint from the VFS as to what range of file offsets it's looking for, and which page is the critical one. Maybe that's as simple as passing in pgoff and order, where pgoff is not necessarily aligned to 1<<order. Or maybe we want to explicitly pass in start, end, critical. I'm in favour of William attending LSFMM, for whatever my opinion is worth. Also Kirill, of course.