On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 08.11.24 15:17, Yafang Shao wrote: > > When testing large folio support with XFS on our servers, we observed that > > only a few large folios are mapped when reading large files via mmap. > > After a thorough analysis, I identified it was caused by the > > `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this > > parameter is set to 128KB. After I tune it to 2MB, the large folio can > > work as expected. However, I believe the large folio behavior should not be > > dependent on the value of read_ahead_kb. It would be more robust if the > > kernel can automatically adopt to it. > > Now I am extremely confused. > > Documentation/ABI/stable/sysfs-block: > > "[RW] Maximum number of kilobytes to read-ahead for filesystems on this > block device." > > > So, with your patch, will we also be changing the readahead size to > exceed that, or simply allocate larger folios and not exceeding the > readahead size (e.g., leaving them partially non-filled)? Exceeding the readahead size for the MADV_HUGEPAGE case is straightforward; this is what the current patch accomplishes. > > If you're also changing the readahead behavior to exceed the > configuration parameter it would sound to me like "I am pushing the > brake pedal and my care brakes; fix the brakes to adopt whether to brake > automatically" :) > > Likely I am missing something here, and how the read_ahead_kb parameter > is used after your patch. The read_ahead_kb parameter continues to function for non-MADV_HUGEPAGE scenarios, whereas special handling is required for the MADV_HUGEPAGE case. It appears that we ought to update the Documentation/ABI/stable/sysfs-block to reflect the changes related to large folios, correct? > > > > > > With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a > > sequential read on a 1GB file using MADV_HUGEPAGE, the differences in > > /proc/meminfo are as follows: > > > > - before this patch > > FileHugePages: 18432 kB > > FilePmdMapped: 4096 kB > > > > - after this patch > > FileHugePages: 1067008 kB > > FilePmdMapped: 1048576 kB > > > > This shows that after applying the patch, the entire 1GB file is mapped to > > huge pages. The stable list is CCed, as without this patch, large folios > > don’t function optimally in the readahead path. > >> It's worth noting that if read_ahead_kb is set to a larger value > that isn't > > aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail to map > > to hugepages. > > > > Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings") > > Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > Cc: stable@xxxxxxxxxxxxxxx > > > > --- > > mm/readahead.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > Changes: > > v1->v2: > > - Drop the align (Matthew) > > - Improve commit log (Andrew) > > > > RFC->v1: https://lore.kernel.org/linux-mm/20241106092114.8408-1-laoar.shao@xxxxxxxxx/ > > - Simplify the code as suggested by Matthew > > > > RFC: https://lore.kernel.org/linux-mm/20241104143015.34684-1-laoar.shao@xxxxxxxxx/ > > > > diff --git a/mm/readahead.c b/mm/readahead.c > > index 3dc6c7a128dd..9b8a48e736c6 100644 > > --- a/mm/readahead.c > > +++ b/mm/readahead.c > > @@ -385,6 +385,8 @@ static unsigned long get_next_ra_size(struct file_ra_state *ra, > > return 4 * cur; > > if (cur <= max / 2) > > return 2 * cur; > > + if (cur > max) > > + return cur; > > return max; > > Maybe something like > > return max_t(unsigned long, cur, max); > > might be more readable (likely "max()" cannot be used because of the > local variable name "max" ...). > > > ... but it's rather weird having a "max" and then returning something > larger than the "max" ... especially with code like Indeed, that could lead to confusion ;) > > "ra->size = get_next_ra_size(ra, max_pages);" > > > Maybe we can improve that by renaming "max_pages" / "max" to what it > actually is supposed to be (which I haven't quite understood yet). Perhaps a more straightforward solution would be to implement it directly at the callsite, as demonstrated below? diff --git a/mm/readahead.c b/mm/readahead.c index 3dc6c7a128dd..187efae95b02 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl, 1UL << order); if (index == expected) { ra->start += ra->size; - ra->size = get_next_ra_size(ra, max_pages); + /* + * Allow the actual size to exceed the readahead window for a + * large folio. + */ + ra->size = get_next_ra_size(ra, max(max_pages, ra->size)); ra->async_size = ra->size; goto readit; } -- Regards Yafang