Re: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 20 Feb 2019 05:44:54 -0800

On Wed, Feb 20, 2019 at 04:17:13AM -0700, William Kucharski wrote:
> At present, the conventional methodology of reading a single base PAGE and
> using readahead to fill in additional pages isn't useful as the entire (in my
> prototype) PMD page needs to be read in before the page can be mapped (and at
> that point it is unclear whether readahead of additional PMD sized pages would
> be of benefit or too costly.

I remember discussing readahead with Kirill in the past.  Here's my
understanding of how it works today and why it probably doesn't work any
more once we have THPs.  We mark some page partway to the current end of
the readahead window with the ReadAhead page flag.  Once we get to it,
we trigger more readahead and change the location of the page with the
ReadAhead flag.  Our current RA window is on the order of 256kB.

With THPs, we're mapping 2MB at a time.  We don't get a warning every
256kB that we're getting close to the end of our RA window.  We only get
to know every 2MB.  So either we can increase the RA window from 256kB
to 2MB, or we have to manage without RA at all.

Most systems these days have SSDs, so the whole concept of RA probably
needs to be rethought.  We should try to figure out if we care about
the performance of rotating rust, and other high-latency systems like
long-distance networking and USB sticks.  (I was going to say network
filesystems in general, but then I remembered clameter's example of
400Gbps networks being faster than DRAM, so local networks clearly aren't
a problem any more).

Maybe there's scope for a session on readahead in general, but I don't
know who'd bring data and argue for what actions based on it.

> Additionally, there are no good interfaces at present to tell filesystem layers
> that content is desired in chunks larger than a hardcoded limit of 64K, or to
> to read disk blocks in chunks appropriate for PMD sized pages.

Right!  It's actually slightly worse than that.  The VFS allocates
pages on behalf of the filesystem and tells the filesystem to read them.
So there's no way to allow the filesystem to say "Actually, I'd rather
read in 32kB chunks because that's how big my compression blocks are".
See page_cache_read() and __do_page_cache_readahead().

I've mentioned in the past my preferred interface for solving this is to
have a new address space operation called ->populate().  The VFS would
call this from both of the above functions, allowing a filesystem to
allocate, say, an order-3 page and place it in i_pages before starting
IO on it.  I haven't done any work towards this, though.  And my opinion
on it might change after having written some code.

That interface would need to have some hint from the VFS as to what
range of file offsets it's looking for, and which page is the critical
one.  Maybe that's as simple as passing in pgoff and order, where pgoff is
not necessarily aligned to 1<<order.  Or maybe we want to explicitly
pass in start, end, critical.

I'm in favour of William attending LSFMM, for whatever my opinion
is worth.  Also Kirill, of course.