Project: Improving the PCP allocator

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Mon, 22 Jan 2024 16:01:34 +0000

As I mentioned here [1], I have Thoughts on how the PCP allocator
works in a memdesc world.  Unlike my earlier Thoughts on the buddy
allocator [2], we can actually make progress towards this one (and
see substantial performance improvement, I believe).  So it's ripe for
someone to pick up.

== With memdescs ==

When we have memdescs, allocating a folio from the buddy is a two step
process.  First we allocate the struct folio from slab, then we ask the
buddy allocator for 2^n pages, each of which gets its memdesc set to
point to this folio.  It'll be similar for other memory descriptors,
but let's keep it simple and just talk about folios for now.

Usually when we free folios, it's due to memory pressure (yes, we'll free
memory due to truncating a file or processes exiting and freeing their
anonymous memory, but that's secondary).  That means we're likely to want
to allocate a folio again soon.  Given that, returning the struct folio
to the slab allocator seems like a waste of time.  The PCP allocator
can hold onto the struct folio as well as the underlying memory and
then just hand it back to the next caller of folio_alloc.  This also
saves us from having to invent a 'struct pcpdesc' and swap the memdesc
pointer from the folio to the pcpdesc.

This implies that we no longer have a single pcp allocator for all types
of memory; rather we have one for each memdesc type.  I think that's
going to be OK, but it might introduce some problems.

== Before memdescs ==

Today we take all comers on the PCP list.  __free_pages() calls
free_the_page() calls free_unref_page() calls free_unref_page_prepare()
calls free_pages_prepare() which undoes all the PageCompound work.

Most multi-page allocations are compound.  Slab, file, anon; it's all
compound.  I propose that we _only_ keep compound memory on the PCP list.
Freeing non-compound multi-page memory can either convert it into compound
pages before being placed on the PCP list or just hand the memory back
to the buddy allocator.  Non-compound multi-page allocations can either
go straight to buddy or grab from the PCP list and undo the compound
nature of the pages.

I think this could be a huge saving.  Consider allocating an order-9 PMD
sized THP.  Today we initialise compound_head in each of the 511 tail
pages.  Since a page is 64 bytes, we touch 32kB of memory!  That's 2/3 of
my CPU's L1 D$, so it's just pushed out a good chunk of my working set.
And it's all dirty, so it has to get written back.

We still need to distinguish between specifically folios (which
need the folio_prep_large_rmappable() call on allocation and
folio_undo_large_rmappable() on free) and other compound allocations which
do not need or want this, but that's touching one/two extra cachelines,
not 511.

Do we have a volunteer?

[1] https://lore.kernel.org/linux-mm/Za2lS-jG1s-HCqbx@xxxxxxxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/ZamnIGxD8_dOJVi6@xxxxxxxxxxxxxxxxxxxx/