On Sun, Jan 28, 2024 at 09:12:12PM -0500, Mike Snitzer wrote: > On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > > > > On Sun, Jan 28 2024 at 5:02P -0500, > > > > > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > Understood. But ... the application is asking for as much readahead as > > > > possible, and the sysadmin has said "Don't readahead more than 64kB at > > > > a time". So why will we not get a bug report in 1-15 years time saying > > > > "I put a limit on readahead and the kernel is ignoring it"? I think > > > > typically we allow the sysadmin to override application requests, > > > > don't we? > > > > > > The application isn't knowingly asking for readahead. It is asking to > > > mmap the file (and reporter wants it done as quickly as possible.. > > > like occurred before). > > > > .. which we do within the constraints of the given configuration. > > > > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap > > > request size based on read-ahead setting") -- same logic, just applied > > > to callchain that ends up using madvise(MADV_WILLNEED). > > > > Not really. There is a difference between performing a synchronous > > read IO here that we must complete, compared to optimistic > > asynchronous read-ahead which we can fail or toss away without the > > user ever seeing the data the IO returned. > > > > We want required IO to be done in as few, larger IOs as possible, > > and not be limited by constraints placed on background optimistic > > IOs. > > > > madvise(WILLNEED) is optimistic IO - there is no requirement that it > > complete the data reads successfully. If the data is actually > > required, we'll guarantee completion when the user accesses it, not > > when madvise() is called. IOWs, madvise is async readahead, and so > > really should be constrained by readahead bounds and not user IO > > bounds. > > > > We could change this behaviour for madvise of large ranges that we > > force into the page cache by ignoring device readahead bounds, but > > I'm not sure we want to do this in general. > > > > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages > > value in this situation to override the device limit for large > > ranges (for some definition of large - say 10x bdi->ra_pages) and > > restore it once the readahead operation is done. This would make it > > behave less like readahead and more like a user read from an IO > > perspective... > > I'm not going to pretend like I'm an expert in this code or all the > distinctions that are in play. BUT, if you look at the high-level > java reproducer: it is requesting mmap of a finite size, starting from > the beginning of the file: > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); Mapping an entire file does not mean "we are going to access the entire file". Lots of code will do this, especially those that do random accesses within the file. > Yet you're talking about the application like it is stabbingly > triggering unbounded async reads that can get dropped, etc, etc. I I don't know what the application actually does. All I see is a microbenchmark that mmaps() a file and walks it sequentially. On a system where readahead has been tuned to de-prioritise sequential IO performance. > just want to make sure the subtlety of (ab)using madvise(WILLNEED) > like this app does isn't incorrectly attributed to something it isn't. > The app really is effectively requesting a user read of a particular > extent in terms of mmap, right? madvise() is an -advisory- interface that does not guarantee any specific behaviour. the man page says: MADV_WILLNEED Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.) It says nothing about guaranteeing that all the data is brought into memory, or that if it does get brought into memory that it will remain there until the application accesses it. It doesn't even imply that IO will even be done immediately. Any application relying on madvise() to fully populate the page cache range before returning is expecting behaviour that is not documented nor guaranteed. Similarly, the fadvise64() man page does not say that WILLNEED will bring the entire file into cache: POSIX_FADV_WILLNEED The specified data will be accessed in the near future. POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be de‐ creased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely use‐ ful.) > BTW, your suggestion to have this app fiddle with ra_pages and then No, I did not suggest that the app fiddle with anything. I was talking about the in-kernel FADV_WILLNEED implementation changing file->f_ra.ra_pages similar to how FADV_RANDOM and FADV_SEQUENTIAL do to change readahead IO behaviour. That then allows subsequent readahead on that vma->file to use a larger value than the default value pulled in off the device. Largely, I think the problem is that the application has set a readahead limit too low for optimal sequential performance. Complaining that readahead is slow when it has been explicitly tuned to be slow doesn't really seem like a problem we can fix with code. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx