Re: Classification of reads within a filesystem

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 21 Jul 2021 18:52:05 +0100

On Wed, Jul 21, 2021 at 10:38:59PM +0530, Shyam Prasad N wrote:
> In a scenario where a user/application issues a readahead/fadvise for
> large data ranges in advance (informing the kernel that they intend to
> read these data ranges soon). Depending on how much data ranges these
> calls cover, it could keep the network quite busy for a network
> filesystem (or the disk for a block filesystem).
> 
> I see some value if filesystems have the ability to differentiate the
> reads from regular buffered reads by users. In such cases, the
> filesystem can choose to throttle the readahead reads, so that there's
> a specified bandwidth that's still available for regular reads.
> 
> I wanted to get your opinions about this. And whether this can be done
> already in VFS ->readahead and ->readpage calls in the filesystems?

This is something I have an interest in, but haven't had time to pursue.
The readahead code gets this information because the page cache
calls page_cache_sync_ra() if it needs this page right now, and calls
page_cache_async_ra() if it thinks it will need the page in the future.

ondemand_readahead() currently gets a true/false parameter
(hit_readahead_marker), although my folio patches change it to pass in
a folio or NULL.  That is then *not* passed to the filesystem, but it
could be information passed in the ractl.

There's also some tidying-up to be done around faulting.  Currently
fault-around doesn't have a way to express "read me all the pages around
page N".  Instead it just assumes that pages N-R/2 to N+R/2 are the
right ones to fetch when it should be left up to the filesystem or the
readahead code to determine what window of pages to fetch.

Another thing I have an interest in doing but not had opportunity to
pursue is making ->readpage synchronous.  The current MM code always
calls ->readahead first and only calls ->readpage if ->readahead fails.
That means that all the async ->readpage work is actually wrong; we
want to return the best error possible from ->readpage, even if that
means sleeping.

Oh ... except for swap.  For NFS only, it calls ->readpage, so it really
wants ->readpage to be async so it can kick off multiple pages and
then wait for the one it actually needs.  That gets into a conversation
about how much we really care about swap-over-NFS, whether swap should
be using ->readpage or ->direct_IO, and whether swap should use the
file readahead code or its own virtual address based readahead code.
Most of those discussions are outside my area of expertise.