Sharing the (failed) experimental results of the idea "Keeping compound pages as compound pages in the PCP"

Harry Yoo <harry.yoo@xxxxxxxxxx> · Wed, 12 Mar 2025 17:53:35 +0900

was: (Project: Improving the PCP allocator)

On Mon, Jan 22, 2024 at 04:01:34PM +0000, Matthew Wilcox wrote:
> As I mentioned here [1], I have Thoughts on how the PCP allocator
> works in a memdesc world.  Unlike my earlier Thoughts on the buddy
> allocator [2], we can actually make progress towards this one (and
> see substantial performance improvement, I believe).  So it's ripe for
> someone to pick up.
> 
> == With memdescs ==
> 
> When we have memdescs, allocating a folio from the buddy is a two step
> process.  First we allocate the struct folio from slab, then we ask the
> buddy allocator for 2^n pages, each of which gets its memdesc set to
> point to this folio.  It'll be similar for other memory descriptors,
> but let's keep it simple and just talk about folios for now.
> 
> Usually when we free folios, it's due to memory pressure (yes, we'll free
> memory due to truncating a file or processes exiting and freeing their
> anonymous memory, but that's secondary).  That means we're likely to want
> to allocate a folio again soon.  Given that, returning the struct folio
> to the slab allocator seems like a waste of time.  The PCP allocator
> can hold onto the struct folio as well as the underlying memory and
> then just hand it back to the next caller of folio_alloc.  This also
> saves us from having to invent a 'struct pcpdesc' and swap the memdesc
> pointer from the folio to the pcpdesc.
> 
> This implies that we no longer have a single pcp allocator for all types
> of memory; rather we have one for each memdesc type.  I think that's
> going to be OK, but it might introduce some problems.
> 
> == Before memdescs ==
> 
> Today we take all comers on the PCP list.  __free_pages() calls
> free_the_page() calls free_unref_page() calls free_unref_page_prepare()
> calls free_pages_prepare() which undoes all the PageCompound work.
> 
> Most multi-page allocations are compound.  Slab, file, anon; it's all
> compound.  I propose that we _only_ keep compound memory on the PCP list.
> Freeing non-compound multi-page memory can either convert it into compound
> pages before being placed on the PCP list or just hand the memory back
> to the buddy allocator.  Non-compound multi-page allocations can either
> go straight to buddy or grab from the PCP list and undo the compound
> nature of the pages.
>
> I think this could be a huge saving.  Consider allocating an order-9 PMD
> sized THP.  Today we initialise compound_head in each of the 511 tail
> pages.  Since a page is 64 bytes, we touch 32kB of memory!  That's 2/3 of
> my CPU's L1 D$, so it's just pushed out a good chunk of my working set.
> And it's all dirty, so it has to get written back.

I've been exploring the idea of keeping compound pages as compound pages
while they reside in PCP lists.

My experimental results so far suggest that reducing the overhead of
destroying and preparing compound pages does not have a substantial
impact on pft and kernbench.

For the record, let me share what I did and the results I obtained.

# Implementation

My toy implementation is here:
https://gitlab.com/hyeyoo/linux/-/tree/pcp-keep-compound

In this implementation, I modified free_pages_prepare() to skip
destroying compound pages when trying to free to the PCP. In case the kernel
failed to acquire the PCP lock, then it destroys it and frees to the
buddy. Because order>0 pages in the PCP are compound pages, rmqueue() does
not prepare compound pages if they are already compound pages.

rmqueue_bulk() prepares compound pages when grabbing some pages from the
buddy and free_pcppages_bulk() destroys compound pages before giving it
back to the buddy.

Non-compound multi-page allocations simply bypass the PCP.

# Evaluation

My experiments are done on my Ryzen 5900X (12core, 24threads) box with 128GiB
of DDR4-3200MHz memory. In the summary below, '2MB-32kB-16kB' means the
kernel first tries to allocate 2MB, and then falls back to 32kB/16kB.

[PFT, 2MB-32kB-16kB]
# timings
Amean     system-1          1.51 (   0.00%)        1.50 (   0.20%)
Amean     system-3          4.20 (   0.00%)        4.19 (   0.39%)
Amean     system-5          7.23 (   0.00%)        7.20 (   0.48%)
Amean     system-7          8.91 (   0.00%)        8.97 (  -0.67%)
Amean     system-12        17.80 (   0.00%)       17.75 (   0.25%)
Amean     system-18        26.26 (   0.00%)       26.14 (   0.47%)
Amean     system-24        34.88 (   0.00%)       34.97 (  -0.28%)
Amean     elapsed-1         1.55 (   0.00%)        1.55 (  -0.00%)
Amean     elapsed-3         1.42 (   0.00%)        1.42 (   0.23%)
Amean     elapsed-5         1.48 (   0.00%)        1.48 (   0.11%)
Amean     elapsed-7         1.47 (   0.00%)        1.47 (  -0.11%)
Amean     elapsed-12        1.50 (   0.00%)        1.49 (   0.18%)
Amean     elapsed-18        1.53 (   0.00%)        1.53 (   0.15%)
Amean     elapsed-24        1.56 (   0.00%)        1.56 (   0.04%)

# faults
Hmean     faults/cpu-1   4268815.8976 (   0.00%)  4270575.7783 (   0.04%)
Hmean     faults/cpu-3   1547723.5433 (   0.00%)  1551782.9058 (   0.26%)
Hmean     faults/cpu-5    895050.1683 (   0.00%)   894949.2879 (  -0.01%)
Hmean     faults/cpu-7    726379.5638 (   0.00%)   719439.1475 (  -0.96%)
Hmean     faults/cpu-12   368548.1793 (   0.00%)   369313.2284 (   0.21%)
Hmean     faults/cpu-18   246883.6901 (   0.00%)   248085.6269 (   0.49%)
Hmean     faults/cpu-24   182289.8542 (   0.00%)   182349.3383 (   0.03%)
Hmean     faults/sec-1   4263408.8848 (   0.00%)  4266020.8111 (   0.06%)
Hmean     faults/sec-3   4631463.5864 (   0.00%)  4642967.3237 (   0.25%)
Hmean     faults/sec-5   4440309.3622 (   0.00%)  4445610.5062 (   0.12%)
Hmean     faults/sec-7   4482600.6360 (   0.00%)  4477462.5227 (  -0.11%)
Hmean     faults/sec-12  4402147.6818 (   0.00%)  4410717.5042 (   0.19%)
Hmean     faults/sec-18  4301172.0933 (   0.00%)  4308668.3804 (   0.17%)
Hmean     faults/sec-24  4234656.1409 (   0.00%)  4237997.7612 (   0.08%)

[PFT, 32kB-16kB]
# timings
Amean     system-1          2.34 (   0.00%)        2.38 (  -1.93%)
Amean     system-3          4.14 (   0.00%)        4.15 (  -0.10%)
Amean     system-5          7.11 (   0.00%)        7.10 (   0.16%)
Amean     system-7          9.32 (   0.00%)        9.28 (   0.42%)
Amean     system-12        17.92 (   0.00%)       17.96 (  -0.22%)
Amean     system-18        25.86 (   0.00%)       25.62 (   0.93%)
Amean     system-24        35.67 (   0.00%)       35.66 (   0.03%)
Amean     elapsed-1         2.47 (   0.00%)        2.51 *  -1.95%*
Amean     elapsed-3         1.42 (   0.00%)        1.42 (  -0.07%)
Amean     elapsed-5         1.46 (   0.00%)        1.45 (   0.32%)
Amean     elapsed-7         1.47 (   0.00%)        1.46 (   0.30%)
Amean     elapsed-12        1.51 (   0.00%)        1.51 (  -0.31%)
Amean     elapsed-18        1.53 (   0.00%)        1.52 (   0.22%)
Amean     elapsed-24        1.52 (   0.00%)        1.52 (   0.09%)

# faults
Hmean     faults/cpu-1   2674791.9210 (   0.00%)  2622289.9694 *  -1.96%*
Hmean     faults/cpu-3   1548744.0561 (   0.00%)  1547049.2525 (  -0.11%)
Hmean     faults/cpu-5    909932.0813 (   0.00%)   911274.4409 (   0.15%)
Hmean     faults/cpu-7    697042.1240 (   0.00%)   700115.5680 (   0.44%)
Hmean     faults/cpu-12   365314.7268 (   0.00%)   364452.9596 (  -0.24%)
Hmean     faults/cpu-18   251711.1949 (   0.00%)   253133.5633 (   0.57%)
Hmean     faults/cpu-24   183212.0204 (   0.00%)   183350.4502 (   0.08%)
Hmean     faults/sec-1   2673435.8060 (   0.00%)  2621259.7412 *  -1.95%*
Hmean     faults/sec-3   4637478.9272 (   0.00%)  4634486.4260 (  -0.06%)
Hmean     faults/sec-5   4531908.1729 (   0.00%)  4541788.9414 (   0.22%)
Hmean     faults/sec-7   4482779.5699 (   0.00%)  4499528.9948 (   0.37%)
Hmean     faults/sec-12  4365201.3350 (   0.00%)  4353874.4404 (  -0.26%)
Hmean     faults/sec-18  4314991.0014 (   0.00%)  4321977.0563 (   0.16%)
Hmean     faults/sec-24  4344249.9657 (   0.00%)  4345263.1455 (   0.02%)

[PFT, 16kB]
# timings
Amean     system-1          3.13 (   0.00%)        3.22 *  -2.84%*
Amean     system-3          4.31 (   0.00%)        4.38 (  -1.70%)
Amean     system-5          6.93 (   0.00%)        6.95 (  -0.21%)
Amean     system-7          9.40 (   0.00%)        9.47 (  -0.66%)
Amean     system-12        17.72 (   0.00%)       17.75 (  -0.15%)
Amean     system-18        26.19 (   0.00%)       26.12 (   0.24%)
Amean     system-24        35.41 (   0.00%)       35.31 (   0.29%)
Amean     elapsed-1         3.34 (   0.00%)        3.43 *  -2.79%*
Amean     elapsed-3         1.50 (   0.00%)        1.52 *  -1.67%*
Amean     elapsed-5         1.44 (   0.00%)        1.44 (  -0.09%)
Amean     elapsed-7         1.46 (   0.00%)        1.46 (  -0.43%)
Amean     elapsed-12        1.50 (   0.00%)        1.50 (  -0.04%)
Amean     elapsed-18        1.50 (   0.00%)        1.49 (   0.18%)
Amean     elapsed-24        1.51 (   0.00%)        1.50 (   0.58%)

# faults
Hmean     faults/cpu-1   1975802.9271 (   0.00%)  1921547.2378 *  -2.75%*
Hmean     faults/cpu-3   1468850.1816 (   0.00%)  1444987.8129 *  -1.62%*
Hmean     faults/cpu-5    921763.7641 (   0.00%)   920183.2144 (  -0.17%)
Hmean     faults/cpu-7    686678.6801 (   0.00%)   681824.6507 (  -0.71%)
Hmean     faults/cpu-12   367972.3092 (   0.00%)   367485.9867 (  -0.13%)
Hmean     faults/cpu-18   248991.2074 (   0.00%)   249470.5429 (   0.19%)
Hmean     faults/cpu-24   184426.2136 (   0.00%)   184968.3682 (   0.29%)
Hmean     faults/sec-1   1975110.1317 (   0.00%)  1920789.4502 *  -2.75%*
Hmean     faults/sec-3   4400629.2802 (   0.00%)  4327578.7473 *  -1.66%*
Hmean     faults/sec-5   4594228.5709 (   0.00%)  4589647.3446 (  -0.10%)
Hmean     faults/sec-7   4524451.1451 (   0.00%)  4502414.9629 (  -0.49%)
Hmean     faults/sec-12  4391814.1854 (   0.00%)  4392709.6380 (   0.02%)
Hmean     faults/sec-18  4406851.3807 (   0.00%)  4409517.3005 (   0.06%)
Hmean     faults/sec-24  4376471.9465 (   0.00%)  4395938.4025 (   0.44%)

[kernbench, 2MB-32kB-16kB THP]
Amean     user-24    16236.12 (   0.00%)    16402.41 (  -1.02%)
Amean     syst-24     1289.28 (   0.00%)     1329.39 *  -3.11%*
Amean     elsp-24      763.75 (   0.00%)      775.03 (  -1.48%)

[kernbench, 32kB-16kB THP]
Amean     user-24    16334.94 (   0.00%)    16314.30 (   0.13%)
Amean     syst-24     1565.50 (   0.00%)     1609.74 *  -2.83%*
Amean     elsp-24      779.84 (   0.00%)      780.55 (  -0.09%)

[kernbench, 16kB]
Amean     user-24    16205.47 (   0.00%)    16282.73 *  -0.48%*
Amean     syst-24     1784.03 (   0.00%)     1837.13 *  -2.98%*
Amean     elsp-24      787.37 (   0.00%)      791.00 *  -0.46%*

I thought the fact that I made zone->lock in free_pcppages_bulk()
a little bit finer-grained in the implementation had a negative impact,
so I modified it to be coarse-grained and experimented again, but it was
still not a clear win on kernbench and pft.

So... yeah. The results for this idea are not that encouraging.
Based on the results, my theories are:

1) The overhead of preparing compound pages is too small to show measurable
difference (the percentage of prep_compound_page() profiles in the total
profile is measured as <0.5% on pft).

2) At the time the kernel prepares compound pages, a large chunk of the
cpu cache is already dirtied (e.g., when allocating 2MB anon THP, the kernel
simply touches the whole 2MB of memory), thus dirtying 1.56% more of
cache lines does not have a huge impact on performance.

-- 
Cheers,
Harry