Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 30 Jan 2024 19:34:59 +0800

On Tue, Jan 30, 2024 at 04:29:35PM +1100, Dave Chinner wrote:
> On Tue, Jan 30, 2024 at 11:13:50AM +0800, Ming Lei wrote:
> > On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote:
> > > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote:
> > > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote:
> > > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote:
> > > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > > > > Follows the current report:
> > > > > > 
> > > > > > 1) usersapce call madvise(willneed, 1G)
> > > > > > 
> > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is
> > > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589
> > > > > > 
> > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is
> > > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then
> > > > > > the whole application becomes slower.
> > > > > 
> > > > > It gets limited by file->f_ra->ra_pages being initialised to
> > > > > bdi->ra_pages and then never changed as the advice for access
> > > > > methods to the file are changed.
> > > > > 
> > > > > But the problem here is *not the readahead code*. The problem is
> > > > > that the user has configured the device readahead window to be far
> > > > > smaller than is optimal for the storage. Hence readahead is slow.
> > > > > The fix for that is to either increase the device readahead windows,
> > > > > or to change the specific readahead window for the file that has
> > > > > sequential access patterns.
> > > > > 
> > > > > Indeed, we already have that - FADV_SEQUENTIAL will set
> > > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses
> > > > > larger IOs for that access.
> > > > > 
> > > > > That's what should happen here - MADV_WILLNEED does not imply a
> > > > > specific access pattern so the application should be running
> > > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED
> > > > > to start the readahead, and then the rest of the on-demand readahead
> > > > > will get the higher readahead limits.
> > > > > 
> > > > > > This patch changes 3) to use bdi->io_pages as readahead unit.
> > > > > 
> > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set
> > > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the
> > > > > mem.load() implementation in the application converted to use
> > > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the
> > > > > readahead algorithm.
> > > > 
> > > > Here the single .ra_pages may not work, that is why this patch stores
> > > > the willneed range in maple tree, please see the following words from
> > > > the original RH report:
> > > 
> > > > "
> > > > Increasing read ahead is not an option as it has a mixed I/O workload of
> > > > random I/O and sequential I/O, so that a large read ahead is very counterproductive
> > > > to the random I/O and is unacceptable.
> > > > "
> > > 
> > > Yes, I've read the bug. There's no triage that tells us what the
> > > root cause of the application perofrmance issue might be. Just an
> > > assertion that "this is how we did it 10 years ago, it's been
> > > unchanged for all this time, the new kernel we are upgrading
> > > to needs to behave exactly like pre-3.10 era kernels did.
> > > 
> > > And to be totally honest, my instincts tell me this is more likely a
> > > problem with a root cause in poor IO scheduling decisions than be a
> > > problem with the page cache readahead implementation. Readahead has
> > > been turned down to stop the bandwidth it uses via background async
> > > read IO from starving latency dependent foreground random IO
> > > operation, and then we're being asked to turn readahead back up
> > > in specific situations because it's actually needed for performance
> > > in certain access patterns. This is the sort of thing bfq is
> > > intended to solve.
> > 
> > Reading mmaped buffer in userspace is sync IO, and page fault just
> > readahead 64KB. I don't understand how block IO scheduler makes a
> > difference in this single 64KB readahead in case of cache miss.
> 
> I think you've misunderstood what I said. I was refering to the
> original customer problem of "too much readahead IO causes problems
> for latency sensitive IO" issue that lead to the customer setting
> 64kB readahead device limits in the first place.

Looks we are not in same page, I never see words of "latency sensitive IO"
in this report(RHEL-22476).

> 
> That is, if reducing readahead for sequential IO suddenly makes
> synchronous random IO perform a whole lot better and the application
> goes faster, then it indicates the problem is IO dispatch
> prioritisation, not that there is too much readahead. Deprioritising
> readahead will educe it's impact on other IO, without having to
> reduce the readahead windows that provide decent sequential IO
> perofrmance...
> 
> I really think the customer needs to retune their application from
> first principles. Start with the defaults, measure where things are
> slow, address the worst issue by twiddling knobs. Repeat until
> performance is either good enough or they hit on actual problems
> that need code changes.

io priority is set in blkcg/process level, we even don't know if the random IO
and sequential IO are submitted from different process.

Also io priority is only applied when IOs with different priority are
submitted concurrently.

The main input from the report is that iostat shows that read IO request size is
reduced to 64K from 1MB, which isn't something io priority can deal with.

Here from my understanding the problem is that advise(ADV_RANDOM, ADV_SEQUENTIAL,
ADV_WILLNEED) are basically applied on file level instead of range level, even
though range is passed in from these syscalls.

So sequential and random advise are actually exclusively on the whole file.

That is why the customer don't want to set bigger ra_pages because they
are afraid(or have tested) that bigger ra_pages hurts performance of random
IO workload because unnecessary data may be readahead. But readahead
algorithm is supposed to be capable of dealing with it, maybe still not
well enough.

But yes, more details are needed for understand the issue further.

thanks, 
Ming