Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64

Yang Shi <shy828301@xxxxxxxxx> · Thu, 27 Jun 2024 13:54:00 -0700

On Tue, Jun 25, 2024 at 4:12 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 09/04/2024 11:47, Ryan Roberts wrote:
> > Thanks for the CC, Zi! I must admit I'm not great at following the list...
> >
> >
> > On 08/04/2024 19:56, Zi Yan wrote:
> >> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:
> >>
> >>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
> >>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
> >>>>
> >>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
> >>>>> What open questions etc is it raising?
> >
> > I'm happy to see others looking at mTHP, and would be very keen to be involved
> > in any discussion. Unfortunately I won't be able to make it to LSFMM this year -
> > my wife is expecting a baby the same week. I'll register for online, but even
> > joining that is looking unlikely.
>
> [...]
>
> Hi Yang Shi,
>
> I finally got around to watching the video of your presentation; Thanks for
> doing the work to benchmark this on your system.
>
> I just wanted to raise a couple of points, first on your results and secondly on
> your conclusions...

Thanks for following up. Sorry for the late reply, I just came back
from a 2 week vacation and still suffered from jet lag...

>
> Results
> =======
>
> As I'm sure you have seen, I've done some benchmarking with mTHP and contpte,
> also on an Ampere Altra system. Although my system has 2 NUMA nodes (80 CPUs per
> node), I've deliberately disabled one of the nodes to avoid noise from cross
> socket IO. So the HW should look and behave approximately the same as yours.

I used 1 socket system, but 128 cores per node. I used taskset to bind
kernel build tasks on core 10 - 89.

>
> We have one overlapping benchmark - kernel compilation - and our results are not
> a million miles apart. You can see my results for 4KPS at [1] (and you can take
> 16KPS and 64KPS results for reference from [2]).
>
> page size   | Ryan   | Yang Shi
> ------------|--------|---------
> 16K (4KPS)  |  -6.1% |  -5%
> 16K (16KPS) |  -9.2% | -15%
> 64K (64KPS) | -11.4% | -16%
>
> For 4KPS, my "mTHP + contpte" line is equivalent to what you have tested. I'm
> seeing -6% vs your -5%. But the 16KPS and 64KPS results diverge more. I'm not
> sure why these results diverge so much, perhaps you have an idea? From my side,
> I've run these benchmarks many many times with successive kernels and revised
> patches etc, and the numbers are always similar for me. I repeat multiple times
> across multiple reboots and also disable kaslr and (user) aslr to avoid any
> unwanted noise/skew.
>
> The actual test is essentially:
>
> $ make defconfig && time make –s –j80 Image

I'm not sure whether the config may make some difference or not. I
used the default Fedora config. And I'm running my test on Fedora 39
with gcc (GCC) 13.2.1 20230918. I saw you were using ubuntu 22.04. Not
sure whether this is correlated or not.

And Matthew said he didn't see any number close to our number (I can't
remember what exactly he said, but he should mean it) in the
discussion. I'm not sure what number Matthew meant, or he meant your
number?

>
> I'd also be interested in how you are measuring memory. I've measured both peak
> and mean memory (by putting the workload in a cgroup) and see almost double the
> memory increase that you report for 16KPS. Our measurements for other configs match.

I also used memory.peak to measure the memory consumption. I didn't
try different configs. I just noticed more cores may incur more memory
consumption. It is more noticeable with 64KPS.

>
> But I also want to raise a more general point; We are not done with the
> optimizations yet. contpte can also improve performance for iTLB, but this
> requires a change to the page cache to store text in (at least) 64K folios.
> Typically the iTLB is under a lot of pressure and this can help reduce it. This
> change is not in mainline yet (and I still need to figure out how to make the
> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this
> will also move the needle on the other benchmarks you ran. See [3] - I'd
> appreciate any thoughts you have on how to get something like this accepted.

AFAIK, the improvement from reduced iTLB really depends on workloads.
IIRC, MySQL is more sensitive to it. We did some tests with
CONFIG_READ_ONLY_THP_FOR_FS enabled for MySQL, we saw decent
improvement, but I really don't remember the exact number.

>
> [1] https://lore.kernel.org/all/20240215103205.2607016-1-ryan.roberts@xxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@xxxxxxx/
> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@xxxxxxx/
>
> Conclusions
> ===========
>
> I think people in the room already said most of what I want to say;
> Unfortunately there is a trade-off between performance and memory consumption.
> And it is not always practical to dole out the biggest THP we can allocate; lots
> of partially used 2M chunks would lead to a lot of wasted memory. So we need a
> way to let user space configure the kernel for their desired mTHP sizes.
>
> In the long term, it would be great to support an "auto" mode, and the current
> interfaces leave the door open to that. Perhaps your suggestion to start out
> with 64K and collapse to higher orders is one tool that could take us in that
> direction. But 64K is arm64-specific. AMD wants 32K. So you still need some
> mechanism to determine that (and the community wasn't keen on having the arch
> tell us that).
>
> It may actually turn out that we need a more complex interface to allow a (set
> of) mTHP order(s) to be enabled for a specific VMA. We previously concluded that
> if/when the time comes, then madvise_process() should give us what we need. That
> would allow better integration with user space.

The internal fragmentation or memory waste for 2M THP is a chronic
problem. The medium sized THP can help tackle this, but the
performance may not be as good as 2M THP.

So after the discussion I was actually thinking that we may need two
policies based on the workloads since there seems to be no one policy
that works for everyone. One for max TLB utilization improvement, the
other for memory conservative.

For example, the workload which doesn't care too much about memory
waste, they can choose to allocate THP from the biggest suitable
order, for example, 2M for some VM workloads. On the other side of the
spectrum, we can start allocating from smaller order then collapse to
larger order.

The system can have a default policy, the users can change the policy
by calling some interfaces, for example, madvise(). Anyway, just off
the top of my head, I haven't invested too much time in this aspect
yet.

I don't think 64K vs 32K is a problem. The two 32K chunks in the same
64K chunk are properly aligned. 64K is not a very high order, so
starting from 64K for everyone should not be a problem. I don't see
why we have to care about this.

By all the means mentioned above, we may be able to achieve full
"auto" mode in the future.

Actually another problem about the current interface is we may end up
having the same behavior with different settings. For example, having
"inherit" for all orders and have "always" for top level knob may
behave the same as having all orders and top level knob set to
"always". This may result in some confusion and violate the rule for
sysfs interfaces.

>
> Your suggestion about splitting higher orders to 64K at swap out is interesting;
> that might help with some swap fragmentation issues we are currently grappling
> with. But ultimately spitting a folio is expensive and we want to avoid that
> cost as much as possible. I'd prefer to continue down the route that Chris Li is
> taking us so that we can do a better job of allocating swap in the first place.

I think I meant splitting to 64K when we have to split. I don't mean
we split to 64K all the time. If we run into swap fragmentation,
splitting to small order may help reduce the premature OOM and the
cost of splitting may be worth it. Just like what we did for other
paths, for example, page demotion, migration, etc, we split the large
folio if there is not enough memory. I may not articulate this in the
slides and the discussion, sorry for the confusion.

If we have a better way to tackle the swap fragmentation without
splitting, that is definitely more preferred.

>
> Thanks,
> Ryan
>