On Sun, Jan 28 2024 at 5:02P -0500, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > only tries to readahead 512 pages, and the remained part in the advised > > range fallback on normal readahead. > > Does the MAINTAINERS file mean nothing any more? "Ming, please use scripts/get_maintainer.pl when submitting patches." (I've cc'd accordingly with this email). > > If bdi->ra_pages is set as small, readahead will perform not efficient > > enough. Increasing read ahead may not be an option since workload may > > have mixed random and sequential I/O. > > I thik there needs to be a lot more explanation than this about what's > going on before we jump to "And therefore this patch is the right > answer". The patch is "RFC". Ming didn't declare his RFC is "the right answer". All ideas for how best to fix this issue are welcome. I agree this patch's header could've worked harder to establish the problem that it fixes. But I'll now take a crack at backfilling the regression report that motivated this patch be developed: Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 and running this reproducer against a 2G file will take ~5x longer (depending on the system's capabilities), mmap_load_test.java follows: import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.io.RandomAccessFile; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; public class mmap_load_test { public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { if (args.length == 0) { System.out.println("Please provide a file"); System.exit(0); } FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); System.out.println("Loading the file"); long startTime = System.currentTimeMillis(); mem.load(); long endTime = System.currentTimeMillis(); System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); } } reproduce with: javac mmap_load_test.java echo 64 > /sys/block/sda/queue/read_ahead_kb echo 1024 > /sys/block/sda/queue/max_sectors_kb mkfs.xfs /dev/sda mount /dev/sda /mnt/test dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 echo 3 > /proc/sys/vm/drop_caches java mmap_load_test /mnt/test/2G_file Without a fix, like the patch Ming provided, iostat will show rareq-sz is 64 rather than ~1024. > > @@ -972,6 +974,7 @@ struct file_ra_state { > > unsigned int ra_pages; > > unsigned int mmap_miss; > > loff_t prev_pos; > > + struct maple_tree *need_mt; > > No. Embed the struct maple tree. Don't allocate it. Constructive feedback, thanks. > What made you think this was the right approach? But then you closed with an attack, rather than inform Ming and/or others why you feel so strongly, e.g.: Best to keep memory used for file_ra_state contiguous.