Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64

Ryan Roberts <ryan.roberts@xxxxxxx> · Tue, 25 Jun 2024 12:12:47 +0100

On 09/04/2024 11:47, Ryan Roberts wrote:
> Thanks for the CC, Zi! I must admit I'm not great at following the list...
> 
> 
> On 08/04/2024 19:56, Zi Yan wrote:
>> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:
>>
>>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
>>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
>>>>
>>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
>>>>> What open questions etc is it raising?
> 
> I'm happy to see others looking at mTHP, and would be very keen to be involved
> in any discussion. Unfortunately I won't be able to make it to LSFMM this year -
> my wife is expecting a baby the same week. I'll register for online, but even
> joining that is looking unlikely.

[...]

Hi Yang Shi,

I finally got around to watching the video of your presentation; Thanks for
doing the work to benchmark this on your system.

I just wanted to raise a couple of points, first on your results and secondly on
your conclusions...

Results
=======

As I'm sure you have seen, I've done some benchmarking with mTHP and contpte,
also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 CPUs per
node), I've deliberately disabled one of the nodes to avoid noise from cross
socket IO. So the HW should look and behave approximately the same as yours.

We have one overlapping benchmark - kernel compilation - and our results are not
a million miles apart. You can see my results for 4KPS at [1] (and you can take
16KPS and 64KPS results for reference from [2]).

page size   | Ryan   | Yang Shi
------------|--------|---------
16K (4KPS)  |  -6.1% |  -5%
16K (16KPS) |  -9.2% | -15%
64K (64KPS) | -11.4% | -16%

For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested. I'm
seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm not
sure why these results diverge so much, perhaps you have an idea? From my side,
I've run these benchmarks many many times with successive kernels and revised
patches etc, and the numbers are always similar for me. I repeat multiple times
across multiple reboots and also disable kaslr and (user) aslr to avoid any
unwanted noise/skew.

The actual test is essentially:

$ make defconfig && time make –s –j80 Image

I'd also be interested in how you are measuring memory. I've measured both peak
and mean memory (by putting the workload in a cgroup) and see almost double the
memory increase that you report for 16KPS. Our measurements for other configs match.

But I also want to raise a more general point; We are not done with the
optimizations yet. contpte can also improve performance for iTLB, but this
requires a change to the page cache to store text in (at least) 64K folios.
Typically the iTLB is under a lot of pressure and this can help reduce it. This
change is not in mainline yet (and I still need to figure out how to make the
patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this
will also move the needle on the other benchmarks you ran. See [3] - I'd
appreciate any thoughts you have on how to get something like this accepted.

[1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@xxxxxxx/
[2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@xxxxxxx/
[3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@xxxxxxx/

Conclusions
===========

I think people in the room already said most of what I want to say;
Unfortunately there is a trade-off between performance and memory consumption.
And it is not always practical to dole out the biggest THP we can allocate; lots
of partially used 2M chunks would lead to a lot of wasted memory. So we need a
way to let user space configure the kernel for their desired mTHP sizes.

In the long term, it would be great to support an "auto" mode, and the current
interfaces leave the door open to that. Perhaps your suggestion to start out
with 64K and collapse to higher orders is one tool that could take us in that
direction. But 64K is arm64-specific. AMD wants 32K. So you still need some
mechanism to determine that (and the community wasn't keen on having the arch
tell us that).

It may actually turn out that we need a more complex interface to allow a (set
of) mTHP order(s) to be enabled for a specific VMA. We previously concluded that
if/when the time comes, then madvise_process() should give us what we need. That
would allow better integration with user space.

Your suggestion about splitting higher orders to 64K at swap out is interesting;
that might help with some swap fragmentation issues we are currently grappling
with. But ultimately spitting a folio is expensive and we want to avoid that
cost as much as possible. I'd prefer to continue down the route that Chris Li is
taking us so that we can do a better job of allocating swap in the first place.

Thanks,
Ryan