On Tue, Jan 30, 2024 at 09:07:24AM +1100, Dave Chinner wrote: > On Mon, Jan 29, 2024 at 04:25:41PM +0800, Ming Lei wrote: > > On Mon, Jan 29, 2024 at 04:15:16PM +1100, Dave Chinner wrote: > > > On Mon, Jan 29, 2024 at 11:57:45AM +0800, Ming Lei wrote: > > > > On Mon, Jan 29, 2024 at 12:47:41PM +1100, Dave Chinner wrote: > > > > > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote: > > > > Follows the current report: > > > > > > > > 1) usersapce call madvise(willneed, 1G) > > > > > > > > 2) only the 1st part(size is from bdi->io_pages, suppose it is 2MB) is > > > > readahead in madvise(willneed, 1G) since commit 6d2be915e589 > > > > > > > > 3) the other parts(2M ~ 1G) is readahead by unit of bdi->ra_pages which is > > > > set as 64KB by userspace when userspace reads the mmaped buffer, then > > > > the whole application becomes slower. > > > > > > It gets limited by file->f_ra->ra_pages being initialised to > > > bdi->ra_pages and then never changed as the advice for access > > > methods to the file are changed. > > > > > > But the problem here is *not the readahead code*. The problem is > > > that the user has configured the device readahead window to be far > > > smaller than is optimal for the storage. Hence readahead is slow. > > > The fix for that is to either increase the device readahead windows, > > > or to change the specific readahead window for the file that has > > > sequential access patterns. > > > > > > Indeed, we already have that - FADV_SEQUENTIAL will set > > > file->f_ra.ra_pages to 2 * bdi->ra_pages so that readahead uses > > > larger IOs for that access. > > > > > > That's what should happen here - MADV_WILLNEED does not imply a > > > specific access pattern so the application should be running > > > MADV_SEQUENTIAL (triggers aggressive readahead) then MADV_WILLNEED > > > to start the readahead, and then the rest of the on-demand readahead > > > will get the higher readahead limits. > > > > > > > This patch changes 3) to use bdi->io_pages as readahead unit. > > > > > > I think it really should be changing MADV/FADV_SEQUENTIAL to set > > > file->f_ra.ra_pages to bdi->io_pages, not bdi->ra_pages * 2, and the > > > mem.load() implementation in the application converted to use > > > MADV_SEQUENTIAL to properly indicate it's access pattern to the > > > readahead algorithm. > > > > Here the single .ra_pages may not work, that is why this patch stores > > the willneed range in maple tree, please see the following words from > > the original RH report: > > > " > > Increasing read ahead is not an option as it has a mixed I/O workload of > > random I/O and sequential I/O, so that a large read ahead is very counterproductive > > to the random I/O and is unacceptable. > > " > > Yes, I've read the bug. There's no triage that tells us what the > root cause of the application perofrmance issue might be. Just an > assertion that "this is how we did it 10 years ago, it's been > unchanged for all this time, the new kernel we are upgrading > to needs to behave exactly like pre-3.10 era kernels did. > > And to be totally honest, my instincts tell me this is more likely a > problem with a root cause in poor IO scheduling decisions than be a > problem with the page cache readahead implementation. Readahead has > been turned down to stop the bandwidth it uses via background async > read IO from starving latency dependent foreground random IO > operation, and then we're being asked to turn readahead back up > in specific situations because it's actually needed for performance > in certain access patterns. This is the sort of thing bfq is > intended to solve. Reading mmaped buffer in userspace is sync IO, and page fault just readahead 64KB. I don't understand how block IO scheduler makes a difference in this single 64KB readahead in case of cache miss. > > > > Also almost all these advises(SEQUENTIA, WILLNEED, NORMAL, RANDOM) > > ignore the passed range, and the behavior becomes all or nothing, > > instead of something only for the specified range, which may not > > match with man, please see 'man posix_fadvise': > > The man page says: > > The advice is not binding; it merely constitutes an > expectation on behalf of the application. > > > It is even worse for readahead() syscall: > > > > ``` DESCRIPTION readahead() initiates readahead on a file > > so that subsequent reads from that file will be satisfied > > from the cache, and not block on disk I/O (assuming the > > readahead was initiated early enough and that other activity > > on the system did not in the meantime flush pages from the > > cache). ``` > > Yes, that's been "broken" for a long time (since the changes to cap > force_page_cache_readahead() to ra_pages way back when), but the > assumption documented about when readahead(2) will work goes to the > heart of why we don't let user controlled readahead actually do much > in the way of direct readahead. i.e. too much readahead is > typically harmful to IO and system performance and very, very few > applications actually need files preloaded entirely into memory. It is true for normal readahead, but not sure if it is for advise(willneed) or readahead(). > > ---- > > All said, I'm starting to think that there isn't an immediate > upstream kernel change needed right now. I just did a quick check > through the madvise() man page to see if I'd missed anything, and I > most definitely did miss what is a relatively new addition to it: > > MADV_POPULATE_READ (since Linux 5.14) > "Populate (prefault) page tables readable, faulting in all > pages in the range just as if manually reading from each page; > however, avoid the actual memory access that would have been > performed after handling the fault. > > In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide > errors, can be applied to (parts of) existing mappings and will > al‐ ways populate (prefault) page tables readable. One > example use case is prefaulting a file mapping, reading all > file content from disk; however, pages won't be dirtied and > consequently won't have to be written back to disk when > evicting the pages from memory. > > That's exactly what the application is apparently wanting > MADV_WILLNEED to do. Indeed, it works as expected(all mmapped pages are load in madvise(MADV_POPULATE_READ)) in my test code except for 16 ra_pages, but it is less important now. Thanks for this idea! > > Please read the commit message for commit 4ca9b3859dac ("mm/madvise: > introduce MADV_POPULATE_(READ|WRITE) to prefault page tables"). It > has some relevant commentary on why MADV_WILLNEED could not be > modified to meet the pre-population requirements of the applications > that required this pre-population behaviour from the kernel. > > With this, I suspect that the application needs to be updated to > use MADV_POPULATE_READ rather than MADV_WILLNEED, and then we can go > back and do some analysis of the readahead behaviour of the > application and the MADV_POPULATE_READ operation. We may need to > tweak MADV_POPULATE_READ for large readahead IO, but that's OK > because it's no longer "optimistic speculation" about whether the > data is needed in cache - the operation being performed guarantees > that or it fails with an error. IOWs, MADV_POPULATE_READ is > effectively user data IO at this point, not advice about future > access patterns... BTW, in this report, MADV_WILLNEED is used by java library[1], and I guess it could be difficult to update to MADV_POPULATE_READ. [1] https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html Thanks, Ming