Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range

Mike Snitzer <snitzer@xxxxxxxxxx> · Sun, 28 Jan 2024 21:12:12 -0500

On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > >
> > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > Understood.  But ... the application is asking for as much readahead as
> > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > a time".  So why will we not get a bug report in 1-15 years time saying
> > > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > > typically we allow the sysadmin to override application requests,
> > > don't we?
> >
> > The application isn't knowingly asking for readahead.  It is asking to
> > mmap the file (and reporter wants it done as quickly as possible..
> > like occurred before).
>
> .. which we do within the constraints of the given configuration.
>
> > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > request size based on read-ahead setting") -- same logic, just applied
> > to callchain that ends up using madvise(MADV_WILLNEED).
>
> Not really. There is a difference between performing a synchronous
> read IO here that we must complete, compared to optimistic
> asynchronous read-ahead which we can fail or toss away without the
> user ever seeing the data the IO returned.
>
> We want required IO to be done in as few, larger IOs as possible,
> and not be limited by constraints placed on background optimistic
> IOs.
>
> madvise(WILLNEED) is optimistic IO - there is no requirement that it
> complete the data reads successfully. If the data is actually
> required, we'll guarantee completion when the user accesses it, not
> when madvise() is called.  IOWs, madvise is async readahead, and so
> really should be constrained by readahead bounds and not user IO
> bounds.
>
> We could change this behaviour for madvise of large ranges that we
> force into the page cache by ignoring device readahead bounds, but
> I'm not sure we want to do this in general.
>
> Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> value in this situation to override the device limit for large
> ranges (for some definition of large - say 10x bdi->ra_pages) and
> restore it once the readahead operation is done. This would make it
> behave less like readahead and more like a user read from an IO
> perspective...

I'm not going to pretend like I'm an expert in this code or all the
distinctions that are in play.  BUT, if you look at the high-level
java reproducer: it is requesting mmap of a finite size, starting from
the beginning of the file:
FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

Yet you're talking about the application like it is stabbingly
triggering unbounded async reads that can get dropped, etc, etc.  I
just want to make sure the subtlety of (ab)using madvise(WILLNEED)
like this app does isn't incorrectly attributed to something it isn't.
The app really is effectively requesting a user read of a particular
extent in terms of mmap, right?

BTW, your suggestion to have this app fiddle with ra_pages and then
reset it is pretty awful (that is a global setting, being tweaked for
a single use, and exposing random IO to excessive readahead should
there be a heavy mix of IO to the backing block device).  Seems the
app is providing plenty of context that it shouldn't be bounded in
terms of readahead limits, so much so that Ming's patch is conveying
the range the madvise(WILLNEED) is provided by the app so as to _know_
if the requested page(s) fall within the requested size; Linux just
happens to be fulfilling the syscall in terms of readahead.

FYI, here is the evolution of this use-case back from when it issued
larger IO back in 3.14, Ming documented it I'm just sharing in case it
helps:

3.14:
madvise_willneed() -> force_page_cache_readahead()
force_page_cache_readahead() will read all pages in the specified range

3.15:
madvise_willneed() -> vfs_fadvise() -> generic_fadvise() ->
  force_page_cache_readahead()

force_page_cache_readahead() only reads at most device max_sectors bytes,
and the remainder is read in filemap_fault()

Things start to change since:

1) 6d2be915e589 ("mm/readahead.c: fix readahead failure for memoryless
NUMA nodes
and limit readahead max_pages")
which limits at most 2Mbytes to be read in madvise_willneed()
so the remained bytes are read in filemap_fault() via normal readahead

2) 600e19afc5f8 ("mm: use only per-device readahead limit")
limits at most .ra_pages bytes to read in madvise_willneed()

3) 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
relax the limit by reading at most max sectors bytes in madvise_willneed()