Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 30 Jan 2024 11:13:50 +0800

On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > Follows the current report:
> > > > 
> > > > 1) usersapce call madvise(willneed, 1G)
> > > > 
> > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > 
> > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > the whole application becomes slower.
> > > 
> > > It gets limited by file->f_ra->ra_pages being initialised to
> > > bdi->ra_pages and then never changed as the advice for access
> > > methods to the file are changed.
> > > 
> > > But the problem here is *not the readahead code*. The problem is
> > > that the user has configured the device readahead window to be far
> > > smaller than is optimal for the storage. Hence readahead is slow.
> > > The fix for that is to either increase the device readahead windows,
> > > or to change the specific readahead window for the file that has
> > > sequential access patterns.
> > > 
> > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > larger IOs for that access.
> > > 
> > > That's what should happen here - MADV_WILLNEED does not imply a
> > > specific access pattern so the application should be running
> > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > to start the readahead, and then the rest of the on-demand readahead
> > > will get the higher readahead limits.
> > > 
> > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > 
> > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > mem.load() implementation in the application converted to use
> > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > readahead algorithm.
> > 
> > Here the single .ra_pages may not work, that is why this patch stores
> > the willneed range in maple tree, please see the following words from
> > the original RH report:
> 
> > "
> > Increasing read ahead is not an option as it has a mixed I/O workload of
> > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > to the random I/O and is unacceptable.
> > "
> 
> Yes, I've read the bug. There's no triage that tells us what the
> root cause of the application perofrmance issue might be. Just an
> assertion that "this is how we did it 10 years ago, it's been
> unchanged for all this time, the new kernel we are upgrading
> to needs to behave exactly like pre-3.10 era kernels did.
> 
> And to be totally honest, my instincts tell me this is more likely a
> problem with a root cause in poor IO scheduling decisions than be a
> problem with the page cache readahead implementation. Readahead has
> been turned down to stop the bandwidth it uses via background async
> read IO from starving latency dependent foreground random IO
> operation, and then we're being asked to turn readahead back up
> in specific situations because it's actually needed for performance
> in certain access patterns. This is the sort of thing bfq is
> intended to solve.

Reading mmaped buffer in userspace is sync IO, and page fault just
readahead 64KB. I don't understand how block IO scheduler makes a
difference in this single 64KB readahead in case of cache miss.

> 
> 
> > Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM)
> > ignore the passed range, and the behavior becomes all or nothing,
> > instead of something only for the specified range, which may not
> > match with man, please see 'man posix_fadvise':
> 
> The man page says:
> 
> 	The advice is not binding; it merely constitutes an
> 	expectation on behalf of the application.
> 
> > It is even worse for readahead() syscall:
> > 
> > 	``` DESCRIPTION readahead()  initiates readahead on a file
> > 	so that subsequent reads from that file will be satisfied
> > 	from the cache, and not block on disk I/O (assuming the
> > 	readahead was initiated early enough and that other activity
> > 	on the system did not in the meantime flush pages from the
> > 	cache).  ```
> 
> Yes, that's been "broken" for a long time (since the changes to cap
> force_page_cache_readahead() to ra_pages way back when), but the
> assumption documented about when readahead(2) will work goes to the
> heart of why we don't let user controlled readahead actually do much
> in the way of direct readahead. i.e. too much readahead is
> typically harmful to IO and system performance and very, very few
> applications actually need files preloaded entirely into memory.

It is true for normal readahead, but not sure if it is for
advise(willneed) or readahead().

> 
> ----
> 
> All said, I'm starting to think that there isn't an immediate
> upstream kernel change needed right now.  I just did a quick check
> through the madvise() man page to see if I'd missed anything, and I
> most definitely did miss what is a relatively new addition to it:
> 
> MADV_POPULATE_READ (since Linux 5.14)
>      "Populate (prefault) page tables readable, faulting in all
>      pages in the range just as if manually reading from each page;
>      however, avoid the actual memory access that would have been
>      performed after handling the fault.
> 
>      In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide
>      errors, can be applied to (parts of) existing mappings and will
>      al‐ ways  populate (prefault) page tables readable.  One
>      example use case is prefaulting a file mapping, reading all
>      file content from disk; however, pages won't be dirtied and
>      consequently won't have to be written back to disk when
>      evicting the pages from memory.
> 
> That's exactly what the application is apparently wanting
> MADV_WILLNEED to do.

Indeed, it works as expected(all mmapped pages are load in
madvise(MADV_POPULATE_READ)) in my test code except for 16 ra_pages, but
it is less important now.

Thanks for this idea!

> 
> Please read the commit message for commit 4ca9b3859dac ("mm/madvise:
> introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It
> has some relevant commentary on why MADV_WILLNEED could not be
> modified to meet the pre-population requirements of the applications
> that required this pre-population behaviour from the kernel.
> 
> With this, I suspect that the application needs to be updated to
> use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go
> back and do some analysis of the readahead behaviour of the
> application and the MADV_POPULATE_READ operation. We may need to
> tweak MADV_POPULATE_READ for large readahead IO, but that's OK
> because it's no longer "optimistic speculation" about whether the
> data is needed in cache - the operation being performed guarantees
> that or it fails with an error. IOWs, MADV_POPULATE_READ is
> effectively user data IO at this point, not advice about future
> access patterns...

BTW, in this report, MADV_WILLNEED is used by java library[1], and I
guess it could be difficult to update to MADV_POPULATE_READ.

[1] https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html

Thanks,
Ming