On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > Understood. But ... the application is asking for as much readahead as > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > a time". So why will we not get a bug report in 1-15 years time saying > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > typically we allow the sysadmin to override application requests, > > > don't we? > > > > The application isn't knowingly asking for readahead. It is asking to > > mmap the file (and reporter wants it done as quickly as possible.. > > like occurred before). > > ... which we do within the constraints of the given configuration. > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > request size based on read-ahead setting") -- same logic, just applied > > to callchain that ends up using madvise(MADV_WILLNEED). > > Not really. There is a difference between performing a synchronous > read IO here that we must complete, compared to optimistic > asynchronous read-ahead which we can fail or toss away without the > user ever seeing the data the IO returned. Yeah, the big readahead in this patch happens when user starts to read over mmaped buffer instead of madvise(). > > We want required IO to be done in as few, larger IOs as possible, > and not be limited by constraints placed on background optimistic > IOs. > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > complete the data reads successfully. If the data is actually > required, we'll guarantee completion when the user accesses it, not > when madvise() is called. IOWs, madvise is async readahead, and so > really should be constrained by readahead bounds and not user IO > bounds. > > We could change this behaviour for madvise of large ranges that we > force into the page cache by ignoring device readahead bounds, but > I'm not sure we want to do this in general. > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > value in this situation to override the device limit for large > ranges (for some definition of large - say 10x bdi->ra_pages) and > restore it once the readahead operation is done. This would make it > behave less like readahead and more like a user read from an IO > perspective... ->ra_pages is just one hint, which is 128KB at default, and either device or userspace can override it. fadvise/madvise(willneed) already readahead bytes from bdi->io_pages which is the max device sector size(often 10X of ->ra_pages), please see force_page_cache_ra(). Follows the current report: 1) usersapce call madvise(willneed, 1G) 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is readahead in madvise(willneed, 1G) since commit 6d2be915e589 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is set as 64KB by userspace when userspace reads the mmaped buffer, then the whole application becomes slower. This patch changes 3) to use bdi->io_pages as readahead unit. Thanks, Ming