Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression

Yin Fengwei <fengwei.yin@xxxxxxxxx> · Thu, 4 Jan 2024 16:18:00 +0800

On 2024/1/4 09:32, Yang Shi wrote:
On Thu, Dec 21, 2023 at 5:13 PM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:



On 12/22/2023 2:11 AM, Yang Shi wrote:
On Thu, Dec 21, 2023 at 5:40 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:



On 12/21/2023 8:58 AM, Yin Fengwei wrote:
But what I am not sure was whether it's worthy to do such kind of change
as the regression only is seen obviously in micro-benchmark. No evidence
showed the other regressionsin this report is related with madvise. At
least from the perf statstics. Need to check more on stream/ramspeed.
Thanks.

With debugging patch (filter out the stack mapping from THP aligned),
the result of stream can be restored to around 2%:

commit:
     30749e6fbb3d391a7939ac347e9612afe8c26e94
     1111d46b5cbad57486e7a3fab75888accac2f072
     89f60532d82b9ecd39303a74589f76e4758f176f  -> 1111d46b5cbad with
debugging patch

30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75 89f60532d82b9ecd39303a74589
---------------- --------------------------- ---------------------------
       350993           -15.6%     296081 ±  2%      -1.5%     345689
     stream.add_bandwidth_MBps
       349830           -16.1%     293492 ±  2%      -2.3%     341860 ±
2%  stream.add_bandwidth_MBps_harmonicMean
       333973           -20.5%     265439 ±  3%      -1.7%     328403
     stream.copy_bandwidth_MBps
       332930           -21.7%     260548 ±  3%      -2.5%     324711 ±
2%  stream.copy_bandwidth_MBps_harmonicMean
       302788           -16.2%     253817 ±  2%      -1.4%     298421
     stream.scale_bandwidth_MBps
       302157           -17.1%     250577 ±  2%      -2.0%     296054
     stream.scale_bandwidth_MBps_harmonicMean
       339047           -12.1%     298061            -1.4%     334206
     stream.triad_bandwidth_MBps
       338186           -12.4%     296218            -2.0%     331469
     stream.triad_bandwidth_MBps_harmonicMean


The regression of ramspeed is still there.

Thanks for the debugging patch and the test. If no one has objection
to honor MAP_STACK, I'm going to come up with a more formal patch.
Even though thp_get_unmapped_area() is not called for MAP_STACK, stack
area still may be allocated at 2M aligned address theoretically. And
it may be worse with multi-sized THP, for 1M.
Right. Filtering out MAP_STACK can't make sure no THP for stack. Just
reduce the possibility of using THP for stack.

Can you please help test the below patch?
I can't access the testing box now. Oliver will help to test your patch.


Regards
Yin, Fengwei

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 40d94411d492..dc7048824be8 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
         return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
                _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
+              _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
                arch_calc_vm_flag_bits(flags);
  }

But I can't reproduce the pthread regression on my aarch64 VM. It
might be due to the guard stack (the 64K guard stack is at 2M aligned,
the 8M stack is right next to it which starts at 2M + 64K). But I can
see the stack area is not THP eligible anymore with this patch. See:

fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
Size:               8192 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  12 kB
Pss:                  12 kB
Pss_Dirty:            12 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        12 kB
Referenced:           12 kB
Anonymous:            12 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
VmFlags: rd wr mr mw me ac nh

The "nh" flag is set.



Do you have any instructions regarding how to run ramspeed? Anyway I
may not have time debug it until after holidays.
0Day leverages phoronix-test-suite to run ramspeed. So I don't have
direct answer here.

I suppose we could check the configuration of ramspeed in phoronix-test-
suite to understand what's the build options and command options to run
ramspeed:
https://openbenchmarking.org/test/pts/ramspeed

Downloaded the test suite. It looks phronix just runs test 3 (int) and
6 (float). They basically does 4 sub tests to benchmark memory
bandwidth:

  * copy
  * scale copy
  * add copy
  * triad copy

The source buffer is initialized (page fault is triggered), but the
destination area is not. So the page fault + page clear time is
accounted to the result. Clearing huge page may take a little bit more
time. But I didn't see noticeable regression on my aarch64 VM either.
Anyway I'm supposed such test should be run with THP off.



Regards
Yin, Fengwei




Regards
Yin, Fengwei