On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote: > Thanks for testing! This is an interesting result and certainly unexpected > for me. The readahead code allocates naturally aligned pages so based on > the distribution of allocations it seems that before commit ab4443fe3ca6 > readahead window was at least 32 pages (128KB) aligned and so we allocated > order 5 pages. After the commit, the readahead window somehow ended up only > aligned to 20 modulo 32. To follow natural alignment and fill 128KB > readahead window we allocated order 2 page (got us to offset 24 modulo 32), > then order 3 page (got us to offset 0 modulo 32), order 4 page (larger > would not fit in 128KB readahead window now), and order 2 page to finish > filling the readahead window. > > Now I'm not 100% sure why the readahead window alignment changed with > different rounding when placing readahead mark - probably that's some > artifact when readahead window is tiny in the beginning before we scale it > up (I'll verify by tracing whether everything ends up looking correctly > with the current code). So I don't expect this is a problem in ab4443fe3ca6 > as such but it exposes the issue that readahead page insertion code should > perhaps strive to achieve better readahead window alignment with logical > file offset even at the cost of occasionally performing somewhat shorter > readahead. I'll look into this once I dig out of the huge heap of email > after vacation... I was surprised by what you said here, so I went and re-read the code and it doesn't work the way I thought it did. So I had a good long think about how it _should_ work, and I looked for some more corner conditions, and this is what I came up with. The first thing I've done is separate out the two limits. The EOF is a hard limit; we will not allocate pages beyond EOF. The ra->size is a soft limit; we will allocate pages beyond ra->size, but not too far. The second thing I noticed is that index + ra_size could wrap. So add a check for that, and set it to ULONG_MAX. index + ra_size - async_size could also wrap, but this is harmless. We certainly don't want to kick off any more readahead in this circumstance, so leaving 'mark' outside the range [index..ULONG_MAX] is just fine. The third thing is that we could allocate a folio which contains a page at ULONG_MAX. We don't really want that in the page cache; it makes filesystems more complicated if they have to check for that, and we don't allow an order-0 folio at ULONG_MAX, so there's no need for it. This _should_ already be prohibited by the "Don't allocate pages past EOF" check, but let's explicitly prohibit it. Compile tested only. diff --git a/mm/readahead.c b/mm/readahead.c index 130c0e7df99f..742e1f39035b 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl, { struct address_space *mapping = ractl->mapping; pgoff_t index = readahead_index(ractl); - pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; + pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; + pgoff_t limit = index + ra->size; pgoff_t mark = index + ra->size - ra->async_size; int err = 0; gfp_t gfp = readahead_gfp_mask(mapping); @@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl, if (!mapping_large_folio_support(mapping) || ra->size < 4) goto fallback; - limit = min(limit, index + ra->size - 1); - if (new_order < MAX_PAGECACHE_ORDER) { new_order += 2; new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order); new_order = min_t(unsigned int, new_order, ilog2(ra->size)); } + if (limit < index) + limit = ULONG_MAX; filemap_invalidate_lock_shared(mapping); - while (index <= limit) { + while (index < limit) { unsigned int order = new_order; /* Align with smaller pages if needed */ if (index & ((1UL << order) - 1)) order = __ffs(index); + /* Avoid wrap */ + if (index + (1UL << order) == 0) + order--; /* Don't allocate pages past EOF */ - while (index + (1UL << order) - 1 > limit) + while (index + (1UL << order) - 1 > last) order--; err = ra_alloc_folio(ractl, index, mark, order, gfp); if (err)