Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression

Yang Shi <shy828301@xxxxxxxxx> · Wed, 3 Jan 2024 17:32:56 -0800

On Thu, Dec 21, 2023 at 5:13 PM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>
>
>
> On 12/22/2023 2:11 AM, Yang Shi wrote:
> > On Thu, Dec 21, 2023 at 5:40 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 12/21/2023 8:58 AM, Yin Fengwei wrote:
> >>> But what I am not sure was whether it's worthy to do such kind of change
> >>> as the regression only is seen obviously in micro-benchmark. No evidence
> >>> showed the other regressionsin this report is related with madvise. At
> >>> least from the perf statstics. Need to check more on stream/ramspeed.
> >>> Thanks.
> >>
> >> With debugging patch (filter out the stack mapping from THP aligned),
> >> the result of stream can be restored to around 2%:
> >>
> >> commit:
> >>     30749e6fbb3d391a7939ac347e9612afe8c26e94
> >>     1111d46b5cbad57486e7a3fab75888accac2f072
> >>     89f60532d82b9ecd39303a74589f76e4758f176f  -> 1111d46b5cbad with
> >> debugging patch
> >>
> >> 30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75 89f60532d82b9ecd39303a74589
> >> ---------------- --------------------------- ---------------------------
> >>       350993           -15.6%     296081 ±  2%      -1.5%     345689
> >>     stream.add_bandwidth_MBps
> >>       349830           -16.1%     293492 ±  2%      -2.3%     341860 ±
> >> 2%  stream.add_bandwidth_MBps_harmonicMean
> >>       333973           -20.5%     265439 ±  3%      -1.7%     328403
> >>     stream.copy_bandwidth_MBps
> >>       332930           -21.7%     260548 ±  3%      -2.5%     324711 ±
> >> 2%  stream.copy_bandwidth_MBps_harmonicMean
> >>       302788           -16.2%     253817 ±  2%      -1.4%     298421
> >>     stream.scale_bandwidth_MBps
> >>       302157           -17.1%     250577 ±  2%      -2.0%     296054
> >>     stream.scale_bandwidth_MBps_harmonicMean
> >>       339047           -12.1%     298061            -1.4%     334206
> >>     stream.triad_bandwidth_MBps
> >>       338186           -12.4%     296218            -2.0%     331469
> >>     stream.triad_bandwidth_MBps_harmonicMean
> >>
> >>
> >> The regression of ramspeed is still there.
> >
> > Thanks for the debugging patch and the test. If no one has objection
> > to honor MAP_STACK, I'm going to come up with a more formal patch.
> > Even though thp_get_unmapped_area() is not called for MAP_STACK, stack
> > area still may be allocated at 2M aligned address theoretically. And
> > it may be worse with multi-sized THP, for 1M.
> Right. Filtering out MAP_STACK can't make sure no THP for stack. Just
> reduce the possibility of using THP for stack.

Can you please help test the below patch?

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 40d94411d492..dc7048824be8 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
        return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
               _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
               _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
+              _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
               arch_calc_vm_flag_bits(flags);
 }

But I can't reproduce the pthread regression on my aarch64 VM. It
might be due to the guard stack (the 64K guard stack is at 2M aligned,
the 8M stack is right next to it which starts at 2M + 64K). But I can
see the stack area is not THP eligible anymore with this patch. See:

fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
Size:               8192 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  12 kB
Pss:                  12 kB
Pss_Dirty:            12 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        12 kB
Referenced:           12 kB
Anonymous:            12 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
VmFlags: rd wr mr mw me ac nh

The "nh" flag is set.

>
> >
> > Do you have any instructions regarding how to run ramspeed? Anyway I
> > may not have time debug it until after holidays.
> 0Day leverages phoronix-test-suite to run ramspeed. So I don't have
> direct answer here.
>
> I suppose we could check the configuration of ramspeed in phoronix-test-
> suite to understand what's the build options and command options to run
> ramspeed:
> https://openbenchmarking.org/test/pts/ramspeed

Downloaded the test suite. It looks phronix just runs test 3 (int) and
6 (float). They basically does 4 sub tests to benchmark memory
bandwidth:

 * copy
 * scale copy
 * add copy
 * triad copy

The source buffer is initialized (page fault is triggered), but the
destination area is not. So the page fault + page clear time is
accounted to the result. Clearing huge page may take a little bit more
time. But I didn't see noticeable regression on my aarch64 VM either.
Anyway I'm supposed such test should be run with THP off.

>
>
> Regards
> Yin, Fengwei
>
> >
> >>
> >>
> >> Regards
> >> Yin, Fengwei