On Tue, Nov 5, 2024 at 1:00 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Mon, Nov 04, 2024 at 10:30:15PM +0800, Yafang Shao wrote: > > When testing large folio support with XFS on our servers, we observed that > > only a few large folios are mapped when reading large files via mmap. This > > behavior occurs because large folio support is currently implemented only > > for sync readahead, not for async readahead. Consequently, while the > > first filemap fault may map to a large folio, subsequent filemap faults are > > mapped to regular folios. This can be verified with a simple test case, as > > shown below: > > I awear I tested this and it worked at the time. Here's what's supposed > to happen: > > You touch the first byte of the mapping, and there's no page in the page > cache. So we go into the sync readahead path, and force it to read two > PMDs. From your email, I assume this is succeeding. The readahead > flag gets set on the second PMD so that when we get to 2MB, we go into > the 'time to do more readahead' path, ie the 'do_async_mmap_readahead' > function you patch below. > > Now, page_cache_async_ra() does this: > > unsigned int order = folio_order(folio); > ... > if (index == expected) { > ra->start += ra->size; > ra->size = get_next_ra_size(ra, max_pages); > ra->async_size = ra->size; > goto readit; > } > readit: > ractl->_index = ra->start; > page_cache_ra_order(ractl, ra, order); > > So it should already be doing readahead of PMD size. Can you dig into > what's going wrong, now that you understand that this is supposed to > work already? Upon further analysis, it appears that the behavior depends on the /sys/block/*/queue/read_ahead_kb setting. On my test server, this parameter is set to 128KB. For the current approach to work without modification, it would need to be 2MB-aligned. However, I believe MADV_HUGEPAGE behavior should not be dependent on the value of read_ahead_kb. It would be more robust if the kernel automatically aligned it to 2MB when MADV_HUGEPAGE is enabled. The following changes ensure compatibility with non-2MB-aligned read_ahead_kb values: diff --git a/mm/readahead.c b/mm/readahead.c index 3dc6c7a128dd..c024245497cc 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -641,9 +641,9 @@ void page_cache_async_ra(struct readahead_control *ractl, expected = round_down(ra->start + ra->size - ra->async_size, 1UL << order); if (index == expected) { - ra->start += ra->size; - ra->size = get_next_ra_size(ra, max_pages); - ra->async_size = ra->size; + ra->start += ALIGN(ra->size, HPAGE_PMD_NR); + ra->size = ALIGN(get_next_ra_size(ra, max_pages), HPAGE_PMD_NR); + ra->async_size = ALIGN(ra->size, HPAGE_PMD_NR); goto readit; } In addition, adjustments may be needed in the sync readahead path to allow MADV_HUGEPAGE readahead sizes to be tunable based on read_ahead_kb. I will continue to analyze this and submit a revised version accordingly. -- Regards Yafang