Re: Forcing vmscan to drop more (related) pages?

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Wed, 31 Jul 2024 06:48:57 +0930

在 2024/7/31 01:24, Matthew Wilcox 写道:
On Tue, Jul 30, 2024 at 03:35:31PM +0930, Qu Wenruo wrote:
Hi,

With recent btrfs attempt to utilize larger folios (for its metadata), I
am hitting a case like this:

- Btrfs allocated an order 2 folio for metadata X

- Btrfs tries to add the order 2 folio at filepos X
   Then filemap_add_folio() returns -EEXIST for filepos X.

- Btrfs tries to grab the existing metadata
   Then filemap_lock_folio() returns -ENOENT for filepos X.

The above case can have two causes:

a) The folio at filepos X is released between add and lock
    This is pretty rare, but still possible

b) Some folios exist at range [X+4K, X+16K)
    In my observation, this is way more common than case a).

Case b) can be caused by the following situation:

- There is an extent buffer at filepos X
   And it is consisted of 4 order 0 folios.

- vmscan wants to free folio at filepos X
   It calls into the btrfs callback, btree_release_folio().
   And btrfs did all the checks, release the metadata.

   Now all the 4 folios at file pos [X, X+16K) have their private
   flags cleared.

- vmscan freed folio at filepos X
   However the remaining 3 folios X+4K, X+8K, X+12K are still attached
   to the filemap, and in theory we should free all 4 folios in one go.

   And later cause the conflicts with the larger folio we want to insert.

I'm wondering if there is anyway to make sure we can release all
involved folios in one go?
I guess it will need a new callback, and return a list of folios to be
released?

I feel like we're missing a few pieces of this puzzle:

  - Why did btrfs decide to create four order-0 folios in the first
    place?

Maybe the larger folio allocation failed (we go with __GFP_NORETRY |
__GFP_NOWARN for larger folio allocation), thus it falls back to order 0
directly.

  - Why isn't there an EEXIST fallback from order-2 to order-1 to order-0
    folios?

Mostly related to the cross folio handling.

We have existing code to handle multiple order 0 folios, but that's all.
For one single order 2 folio, it's also pretty easy to handle as it
covers the full metadata range.

If we go support other orders, we need to handle mixed orders instead,
which doesn't bring much benefit.

So here we only support order 0, or order 2 (for 16K nodesize).
And that's why we're not using __filemap_get_folio() with FGP_CREATE to
allocate the filemap folios.

Maybe it's better to use a bitmap for allowed orders for FGP_CREATE instead?
As for certain future use cases (e.g. fs supporting blocksize larger
than page size), we will require a minimal folio size anyway and falling
below that is not acceptable.

But there's no need for a new API.  You can remove folios from the page
cache whenever you like.  See delete_from_page_cache_batch() as an
example.

So you mean to manually truncate the other pages, inside the
release_folio() callback?

That sounds feasible, and let me experiment with that solution.

Thanks,
Qu