Re: Issue with 8K folio size in __filemap_get_folio()

Hugh Dickins <hughd@xxxxxxxxxx> · Sun, 3 Dec 2023 21:57:38 -0800 (PST)

On Sun, 3 Dec 2023, Matthew Wilcox wrote:
> On Sun, Dec 03, 2023 at 09:27:57PM +0000, Matthew Wilcox wrote:
> > I was talking with Darrick on Friday and he convinced me that this is
> > something we're going to need to fix sooner rather than later for the
> > benefit of devices with block size 8kB.  So it's definitely on my todo
> > list, but I haven't investigated in any detail yet.
> 
> OK, here's my initial analysis of just not putting order-1 folios
> on the deferred split list.  folio->_deferred_list is only used in
> mm/huge_memory.c, which makes this a nice simple analysis.
> 
>  - folio_prep_large_rmappable() initialises the list_head.  No problem,
>    just don't do that for order-1 folios.
>  - split_huge_page_to_list() will remove the folio from the split queue.
>    No problem, just don't do that.
>  - folio_undo_large_rmappable() removes it from the list if it's
>    on the list.  Again, no problem, don't do that for order-1 folios.
>  - deferred_split_scan() walks the list, it won't find any order-1
>    folios.
> 
>  - deferred_split_folio() will add the folio to the list.  Returning
>    here will avoid adding the folio to the list.  But what consequences
>    will that have?  Ah.  There's only one caller of
>    deferred_split_folio() and it's in page_remove_rmap() ... and it's
>    only called for anon folios anyway.

Yes, the deferred split business is quite nicely localized,
which makes it easy to avoid.

> 
> So it looks like we can support order-1 folios in the page cache without
> any change in behaviour since file-backed folios were never added to
> the deferred split list.

Yes, you just have to code in a few "don't do that"s as above.

And I think it would be reasonable to allow even anon order-1 folios,
but they'd just be prevented from participating in the deferred split
business.

Allowing a PMD to be split at any time is necessary for correctness;
then allowing the underlying folio to be split is a nice-to-have when
trying to meet competition for memory, but it's not a correctness issue,
and the smaller the folio to be split, the less the saving - a lot of
unneeded order-1s could still waste a lot of memory, but it's not so
serious as with the order-9s.

> 
> Now, is this the right direction?  Is it a bug that we never called
> deferred_split_folio() for pagecache folios?  I would defer to Hugh
> or Kirill on this.  Ccs added.

No, not a bug.

The thing with anon folios is that they only have value while mapped
(let's ignore swap and GUP for simplicity), so when page_remove_rmap()
is called in the unmapping, that's a good hint that the hugeness of
the folio is no longer worthwhile; but it's a bad moment for splitting
because of locks held, and quite possibly a stupid time for splitting
too (because just a part of unmapping a large range, when splitting
would merely slow it all down).  Hence anon's deferred split queue.

But for file pages, the file contents must retain their value whether
mapped or not.  The moment to consider splitting the folio itself is
when truncating or punching a hole; and it happens that there is not
a locking problem then, nor any overhead added to eviction.  Hence no
deferred split queue for file.

Shmem does have its per-superblock shmem_unused_huge_shrink(), for
splitting huge folios at EOF when under memory pressure (and it would
be better if it were not restricted to EOF, but could retry folios
which had been EBUSY or EAGAIN at the time of hole-punch).  Maybe
other large folio filesystems should also have such a shrinker.

Hugh