On Mon, Jan 29, 2024 at 12:21:16AM +0000, Matthew Wilcox wrote: > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > On Sun, Jan 28 2024 at 5:02P -0500, > > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > > > only tries to readahead 512 pages, and the remained part in the advised > > > > range fallback on normal readahead. > > > > > > Does the MAINTAINERS file mean nothing any more? > > > > "Ming, please use scripts/get_maintainer.pl when submitting patches." > > That's an appropriate response to a new contributor, sure. Ming has > been submitting patches since, what, 2008? Surely they know how to > submit patches by now. > > > I agree this patch's header could've worked harder to establish the > > problem that it fixes. But I'll now take a crack at backfilling the > > regression report that motivated this patch be developed: > > Thank you. > > > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) > > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. > > > > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that > > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 > > and running this reproducer against a 2G file will take ~5x longer > > (depending on the system's capabilities), mmap_load_test.java follows: > > > > import java.nio.ByteBuffer; > > import java.nio.ByteOrder; > > import java.io.RandomAccessFile; > > import java.nio.MappedByteBuffer; > > import java.nio.channels.FileChannel; > > import java.io.File; > > import java.io.FileNotFoundException; > > import java.io.IOException; > > > > public class mmap_load_test { > > > > public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { > > if (args.length == 0) { > > System.out.println("Please provide a file"); > > System.exit(0); > > } > > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); > > > > System.out.println("Loading the file"); > > > > long startTime = System.currentTimeMillis(); > > mem.load(); > > long endTime = System.currentTimeMillis(); > > System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); > > > > } > > } > > It's good to have the original reproducer. The unfortunate part is > that being at such a high level, it doesn't really show what syscalls > the library makes on behalf of the application. I'll take your word > for it that it calls madvise(MADV_WILLNEED). An strace might not go > amiss. Yeah, it can be fadvise(WILLNEED)/readahead syscall too. > > > reproduce with: > > > > javac mmap_load_test.java > > echo 64 > /sys/block/sda/queue/read_ahead_kb > > echo 1024 > /sys/block/sda/queue/max_sectors_kb > > mkfs.xfs /dev/sda > > mount /dev/sda /mnt/test > > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 > > > > echo 3 > /proc/sys/vm/drop_caches > > (I prefer to unmount/mount /mnt/test; it drops the cache for > /mnt/test/2G_file without affecting the rest of the system) > > > java mmap_load_test /mnt/test/2G_file > > > > Without a fix, like the patch Ming provided, iostat will show rareq-sz > > is 64 rather than ~1024. > > Understood. But ... the application is asking for as much readahead as > possible, and the sysadmin has said "Don't readahead more than 64kB at > a time". So why will we not get a bug report in 1-15 years time saying > "I put a limit on readahead and the kernel is ignoring it"? I think > typically we allow the sysadmin to override application requests, > don't we? ra_pages is just one hint for readahead, the reality is that sysadmin can't understand how much bytes is perfect for readahead. But application often knows how much bytes it will need, so here I think application requirement should have higher priority, especially when application doesn't want kernel to readahead blindly. And madvise/fadvise(WILLNEED) syscall already reads bdi->io_pages first, and which is bigger than ra_pages. Thanks, Ming