Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

Jakub Raček <jracek@xxxxxxxxxx> · Thu, 7 Jun 2018 13:19:26 +0200

Hi,

On 06/07/2018 01:07 PM, Michal Hocko wrote:
[CCing Mel and MM mailing list]

On Wed 06-06-18 14:27:32, Jakub Racek wrote:
Hi,

There is a huge performance regression on the 2 and 4 NUMA node systems on
stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
and NAS parallel benchmarks show upto 50% performance drop.

When running for example 20 stream processes in parallel, we see the following behavior:

* all processes are started at NODE #1
* memory is also allocated on NODE #1
* roughly half of the processes are moved to the NODE #0 very quickly. *
however, memory is not moved to NODE #0 and stays allocated on NODE #1

As the result, half of the processes are running on NODE#0 with memory being
still allocated on NODE#1. This leads to non-local memory accesses
on the high Remote-To-Local Memory Access Ratio on the numatop charts.

So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
node after the process has been moved.

----8<----

The above is an excerpt from performance testing on 4.16 and 4.17 kernels.

For now I'm merely making sure the problem is reported.

Do you have numa balancing enabled?

Yes. The relevant settings are:

kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256

--
Best regards,
Jakub Racek
FMK