On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > On Sun, Jan 28 2024 at 5:02P -0500, > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > > only tries to readahead 512 pages, and the remained part in the advised > > > range fallback on normal readahead. > > > > Does the MAINTAINERS file mean nothing any more? > > "Ming, please use scripts/get_maintainer.pl when submitting patches." That's an appropriate response to a new contributor, sure. Ming has been submitting patches since, what, 2008? Surely they know how to submit patches by now. > I agree this patch's header could've worked harder to establish the > problem that it fixes. But I'll now take a crack at backfilling the > regression report that motivated this patch be developed: Thank you. > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. > > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 > and running this reproducer against a 2G file will take ~5x longer > (depending on the system's capabilities), mmap_load_test.java follows: > > import java.nio.ByteBuffer; > import java.nio.ByteOrder; > import java.io.RandomAccessFile; > import java.nio.MappedByteBuffer; > import java.nio.channels.FileChannel; > import java.io.File; > import java.io.FileNotFoundException; > import java.io.IOException; > > public class mmap_load_test { > > public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { > if (args.length == 0) { > System.out.println("Please provide a file"); > System.exit(0); > } > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); > > System.out.println("Loading the file"); > > long startTime = System.currentTimeMillis(); > mem.load(); > long endTime = System.currentTimeMillis(); > System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); > > } > } It's good to have the original reproducer. The unfortunate part is that being at such a high level, it doesn't really show what syscalls the library makes on behalf of the application. I'll take your word for it that it calls madvise(MADV_WILLNEED). An strace might not go amiss. > reproduce with: > > javac mmap_load_test.java > echo 64 > /sys/block/sda/queue/read_ahead_kb > echo 1024 > /sys/block/sda/queue/max_sectors_kb > mkfs.xfs /dev/sda > mount /dev/sda /mnt/test > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 > > echo 3 > /proc/sys/vm/drop_caches (I prefer to unmount/mount /mnt/test; it drops the cache for /mnt/test/2G_file without affecting the rest of the system) > java mmap_load_test /mnt/test/2G_file > > Without a fix, like the patch Ming provided, iostat will show rareq-sz > is 64 rather than ~1024. Understood. But ... the application is asking for as much readahead as possible, and the sysadmin has said "Don't readahead more than 64kB at a time". So why will we not get a bug report in 1-15 years time saying "I put a limit on readahead and the kernel is ignoring it"? I think typically we allow the sysadmin to override application requests, don't we? > > > @@ -972,6 +974,7 @@ struct file_ra_state { > > > unsigned int ra_pages; > > > unsigned int mmap_miss; > > > loff_t prev_pos; > > > + struct maple_tree *need_mt; > > > > No. Embed the struct maple tree. Don't allocate it. > > Constructive feedback, thanks. > > > What made you think this was the right approach? > > But then you closed with an attack, rather than inform Ming and/or > others why you feel so strongly, e.g.: Best to keep memory used for > file_ra_state contiguous. That's not an attack, it's a genuine question. Is there somewhere else doing it wrong that Ming copied from? Does the documentation need to be clearer? I can't fix what I don't know.