Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression

Yin Fengwei <fengwei.yin@xxxxxxxxx> · Thu, 21 Dec 2023 08:58:42 +0800

On 2023/12/21 08:26, Yang Shi wrote:
On Wed, Dec 20, 2023 at 12:09 PM Yang Shi <shy828301@xxxxxxxxx> wrote:

On Wed, Dec 20, 2023 at 12:34 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:

On 2023/12/20 13:27, Yang Shi wrote:
On Tue, Dec 19, 2023 at 7:41 AM kernel test robot <oliver.sang@xxxxxxxxx> wrote:

Hello,

for this commit, we reported
"[mm]  96db82a66d:  will-it-scale.per_process_ops -95.3% regression"
in Aug, 2022 when it's in linux-next/master
https://lore.kernel.org/all/YwIoiIYo4qsYBcgd@xsang-OptiPlex-9020/

later, we reported
"[mm] f35b5d7d67: will-it-scale.per_process_ops -95.5% regression"
in Oct, 2022 when it's in linus/master
https://lore.kernel.org/all/202210181535.7144dd15-yujie.liu@xxxxxxxxx/

and the commit was reverted finally by
commit 0ba09b1733878afe838fe35c310715fda3d46428
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date:   Sun Dec 4 12:51:59 2022 -0800

now we noticed it goes into linux-next/master again.

we are not sure if there is an agreement that the benefit of this commit
has already overweight performance drop in some mirco benchmark.

we also noticed from https://lore.kernel.org/all/20231214223423.1133074-1-yang@xxxxxxxxxxxxxxxxxxxxxx/
that
"This patch was applied to v6.1, but was reverted due to a regression
report.  However it turned out the regression was not due to this patch.
I ping'ed Andrew to reapply this patch, Andrew may forget it.  This
patch helps promote THP, so I rebased it onto the latest mm-unstable."

IIRC, Huang Ying's analysis showed the regression for will-it-scale
micro benchmark is fine, it was actually reverted due to kernel build
regression with LLVM reported by Nathan Chancellor. Then the
regression was resolved by commit
81e506bec9be1eceaf5a2c654e28ba5176ef48d8 ("mm/thp: check and bail out
if page in deferred queue already"). And this patch did improve kernel
build with GCC by ~3% if I remember correctly.

however, unfortunately, in our latest tests, we still observed below regression
upon this commit. just FYI.

kernel test robot noticed a -84.3% regression of stress-ng.pthread.ops_per_sec on:

Interesting, wasn't the same regression seen last time? And I'm a
little bit confused about how pthread got regressed. I didn't see the
pthread benchmark do any intensive memory alloc/free operations. Do
the pthread APIs do any intensive memory operations? I saw the
benchmark does allocate memory for thread stack, but it should be just
8K per thread, so it should not trigger what this patch does. With
1024 threads, the thread stacks may get merged into one single VMA (8M
total), but it may do so even though the patch is not applied.
stress-ng.pthread test code is strange here:

https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#L573

Even it allocates its own stack, but that attr is not passed
to pthread_create. So it's still glibc to allocate stack for
pthread which is 8M size. This is why this patch can impact
the stress-ng.pthread testing.

Aha, nice catch, I overlooked that.

My understanding is this is different regression (if it's a valid
regression). The previous hotspot was in:
     deferred_split_huge_page
        deferred_split_huge_page
           deferred_split_huge_page
              spin_lock

while this time, the hotspot is in (pmd_lock from do_madvise I suppose):
     - 55.02% zap_pmd_range.isra.0
        - 53.42% __split_huge_pmd
           - 51.74% _raw_spin_lock
              - 51.73% native_queued_spin_lock_slowpath
                 + 3.03% asm_sysvec_call_function
           - 1.67% __split_huge_pmd_locked
              - 0.87% pmdp_invalidate
                 + 0.86% flush_tlb_mm_range
        - 1.60% zap_pte_range
           - 1.04% page_remove_rmap
                0.55% __mod_lruvec_page_state

commit: 1111d46b5cbad57486e7a3fab75888accac2f072 ("mm: align larger anonymous mappings on THP boundaries")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

testcase: stress-ng
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 128G memory
parameters:

          nr_threads: 1
          disk: 1HDD
          testtime: 60s
          fs: ext4
          class: os
          test: pthread
          cpufreq_governor: performance

In addition to that, the commit also has significant impact on the following tests:

+------------------+-----------------------------------------------------------------------------------------------+
| testcase: change | stream: stream.triad_bandwidth_MBps -12.1% regression                                         |
| test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory   |
| test parameters  | array_size=50000000                                                                           |
|                  | cpufreq_governor=performance                                                                  |
|                  | iterations=10x                                                                                |
|                  | loop=100                                                                                      |
|                  | nr_threads=25%                                                                                |
|                  | omp=true                                                                                      |
+------------------+-----------------------------------------------------------------------------------------------+
| testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed.Average.Integer.mb_s -3.5% regression       |
| test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory    |
| test parameters  | cpufreq_governor=performance                                                                  |
|                  | option_a=Average                                                                              |
|                  | option_b=Integer                                                                              |
|                  | test=ramspeed-1.4.3                                                                           |
+------------------+-----------------------------------------------------------------------------------------------+
| testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed.Average.FloatingPoint.mb_s -3.0% regression |
| test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory    |
| test parameters  | cpufreq_governor=performance                                                                  |
|                  | option_a=Average                                                                              |
|                  | option_b=Floating Point                                                                       |
|                  | test=ramspeed-1.4.3                                                                           |
+------------------+-----------------------------------------------------------------------------------------------+

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
| Closes: https://lore.kernel.org/oe-lkp/202312192310.56367035-oliver.sang@xxxxxxxxx

Details are as below:
-------------------------------------------------------------------------------------------------->

The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231219/202312192310.56367035-oliver.sang@xxxxxxxxx

=========================================================================================
class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
    os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/1/debian-11.1-x86_64-20220510.cgz/lkp-csl-d02/pthread/stress-ng/60s

commit:
    30749e6fbb ("mm/memory: replace kmap() with kmap_local_page()")
    1111d46b5c ("mm: align larger anonymous mappings on THP boundaries")

30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75
---------------- ---------------------------
           %stddev     %change         %stddev
               \          |                \
    13405796           -65.5%    4620124        cpuidle..usage
        8.00            +8.2%       8.66 ą  2%  iostat.cpu.system
        1.61           -60.6%       0.63        iostat.cpu.user
      597.50 ą 14%     -64.3%     213.50 ą 14%  perf-c2c.DRAM.local
        1882 ą 14%     -74.7%     476.83 ą  7%  perf-c2c.HITM.local
     3768436           -12.9%    3283395        vmstat.memory.cache
      355105           -75.7%      86344 ą  3%  vmstat.system.cs
      385435           -20.7%     305714 ą  3%  vmstat.system.in
        1.13            -0.2        0.88        mpstat.cpu.all.irq%
        0.29            -0.2        0.10 ą  2%  mpstat.cpu.all.soft%
        6.76 ą  2%      +1.1        7.88 ą  2%  mpstat.cpu.all.sys%
        1.62            -1.0        0.62 ą  2%  mpstat.cpu.all.usr%
     2234397           -84.3%     350161 ą  5%  stress-ng.pthread.ops
       37237           -84.3%       5834 ą  5%  stress-ng.pthread.ops_per_sec
      294706 ą  2%     -68.0%      94191 ą  6%  stress-ng.time.involuntary_context_switches
       41442 ą  2%   +5023.4%    2123284        stress-ng.time.maximum_resident_set_size
     4466457           -83.9%     717053 ą  5%  stress-ng.time.minor_page_faults

The larger RSS and fewer page faults are expected.

      243.33           +13.5%     276.17 ą  3%  stress-ng.time.percent_of_cpu_this_job_got
      131.64           +27.7%     168.11 ą  3%  stress-ng.time.system_time
       19.73           -82.1%       3.53 ą  4%  stress-ng.time.user_time

Much less user time. And it seems to match the drop of the pthread metric.

     7715609           -80.2%    1530125 ą  4%  stress-ng.time.voluntary_context_switches
       76728           -80.8%      14724 ą  4%  perf-stat.i.minor-faults
     5600408           -61.4%    2160997 ą  5%  perf-stat.i.node-loads
     8873996           +52.1%   13499744 ą  5%  perf-stat.i.node-stores
      112409           -81.9%      20305 ą  4%  perf-stat.i.page-faults
        2.55           +89.6%       4.83        perf-stat.overall.MPKI

Much more TLB misses.

        1.51            -0.4        1.13        perf-stat.overall.branch-miss-rate%
       19.26           +24.5       43.71        perf-stat.overall.cache-miss-rate%
        1.70           +56.4%       2.65        perf-stat.overall.cpi
      665.84           -17.5%     549.51 ą  2%  perf-stat.overall.cycles-between-cache-misses
        0.12 ą  4%      -0.1        0.04        perf-stat.overall.dTLB-load-miss-rate%
        0.08 ą  2%      -0.0        0.03        perf-stat.overall.dTLB-store-miss-rate%
       59.16            +0.9       60.04        perf-stat.overall.iTLB-load-miss-rate%
        1278           +86.1%       2379 ą  2%  perf-stat.overall.instructions-per-iTLB-miss
        0.59           -36.1%       0.38        perf-stat.overall.ipc

Worse IPC and CPI.

   2.078e+09           -48.3%  1.074e+09 ą  4%  perf-stat.ps.branch-instructions
    31292687           -61.2%   12133349 ą  2%  perf-stat.ps.branch-misses
    26057291            -5.9%   24512034 ą  4%  perf-stat.ps.cache-misses
   1.353e+08           -58.6%   56072195 ą  4%  perf-stat.ps.cache-references
      365254           -75.8%      88464 ą  3%  perf-stat.ps.context-switches
   1.735e+10           -22.4%  1.346e+10 ą  2%  perf-stat.ps.cpu-cycles
       60838           -79.1%      12727 ą  6%  perf-stat.ps.cpu-migrations
     3056601 ą  4%     -81.5%     565354 ą  4%  perf-stat.ps.dTLB-load-misses
   2.636e+09           -50.7%    1.3e+09 ą  4%  perf-stat.ps.dTLB-loads
     1155253 ą  2%     -83.0%     196581 ą  5%  perf-stat.ps.dTLB-store-misses
   1.473e+09           -57.4%  6.268e+08 ą  3%  perf-stat.ps.dTLB-stores
     7997726           -73.3%    2131477 ą  3%  perf-stat.ps.iTLB-load-misses
     5521346           -74.3%    1418623 ą  2%  perf-stat.ps.iTLB-loads
   1.023e+10           -50.4%  5.073e+09 ą  4%  perf-stat.ps.instructions
       75671           -80.9%      14479 ą  4%  perf-stat.ps.minor-faults
     5549722           -61.4%    2141750 ą  4%  perf-stat.ps.node-loads
     8769156           +51.6%   13296579 ą  5%  perf-stat.ps.node-stores
      110795           -82.0%      19977 ą  4%  perf-stat.ps.page-faults
   6.482e+11           -50.7%  3.197e+11 ą  4%  perf-stat.total.instructions
        0.00 ą 37%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc_node.memcg_alloc_slab_cgroups.allocate_slab
        0.01 ą 18%   +8373.1%       0.73 ą 49%  perf-sched.sch_delay.avg.ms.__cond_resched.down_read.do_madvise.__x64_sys_madvise.do_syscall_64
        0.01 ą 16%   +4600.0%       0.38 ą 24%  perf-sched.sch_delay.avg.ms.__cond_resched.down_read.exit_mm.do_exit.__x64_sys_exit

More time spent in madvise and munmap. but I'm not sure whether this
is caused by tearing down the address space when exiting the test. If
so it should not count in the regression.
It's not for the whole address space tearing down. It's for pthread
stack tearing down when pthread exit (can be treated as address space
tearing down? I suppose so).

https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384
https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576

It explains the problem. The madvise() does have some extra overhead
for handling THP (splitting pmd, deferred split queue, etc).

Another thing is whether it's worthy to make stack use THP? It may be
useful for some apps which need large stack size?

Kernel actually doesn't apply THP to stack (see
vma_is_temporary_stack()). But kernel can't know whether the VMA is
stack or not by checking VM_GROWSDOWN | VM_GROWSUP flags. So if glibc
doesn't set the proper flags to tell kernel the area is stack, kernel
just treats it as normal anonymous area. So glibc should set up stack
properly IMHO.

If I read the code correctly, nptl allocates stack by the below code:

mem = __mmap (NULL, size, (guardsize == 0) ? prot : PROT_NONE,
                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

See https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L563

The MAP_STACK is used, but it is a no-op on Linux. So the alternative
is to make MAP_STACK useful on Linux instead of changing glibc. But
the blast radius seems much wider.
Yes. MAP_STACK is also mentioned in manpage of mmap. I did test to
filter out of the MAP_STACK mapping based on this patch. The regression
in stress-ng.pthread was gone. I suppose this is kind of safe because
the madvise call is only applied to glibc allocated stack.

But what I am not sure was whether it's worthy to do such kind of change
as the regression only is seen obviously in micro-benchmark. No evidence
showed the other regressionsin this report is related with madvise. At
least from the perf statstics. Need to check more on stream/ramspeed. 
Thanks.

Regards
Yin, Fengwei

Regards
Yin, Fengwei