Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists

Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> · Mon, 15 Apr 2024 20:17:59 +0800

On 2024/4/15 18:52, David Hildenbrand wrote:
On 15.04.24 10:59, Kefeng Wang wrote:

On 2024/4/15 16:18, Barry Song wrote:
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang 
<wangkefeng.wang@xxxxxxxxxx> wrote:

Both the file pages and anonymous pages support large folio, high-order
pages except PMD_ORDER will also be allocated frequently which could
increase the zone lock contention, allow high-order pages on pcp lists
could reduce the big zone lock contention, but as commit 44042b449872
("mm/page_alloc: allow high-order pages to be stored on the per-cpu 
lists")
pointed, it may not win in all the scenes, add a new control sysfs to
enable or disable specified high-order pages stored on PCP lists, 
the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by 
default.

This is precisely something Baolin and I have discussed and intended
to implement[1],
but unfortunately, we haven't had the time to do so.

Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
not for all cases and not very stable, so re-implemented it by according
to the user requirement and enable it dynamically.

I'm wondering, though, if this is really a suitable candidate for a 
sysctl toggle. Can anybody really come up with an educated guess for 
these values?

Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.

Especially reading "Benchmarks Score shows a little improvoment(0.28%)" 
and "it may not win in all the scenes", to me it mostly sounds like 
"minimal impact" -- so who cares?

Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we need to
find other better testcase, maybe some test on Andriod(heavy use 64K, no
PMD THP), or LKP maybe give some help?

I will try to find other testcase to show the benefit.

How much is the cost vs. benefit of just having one sane system 
configuration?

For arm64 with 4k, five more high-orders(4~8), five more pcplists,
and for high-orders, we assumes most of them are moveable, but maybe
not, so enable it by default maybe more fragmentization, see
5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized 
allocations").