Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 20, 2023 at 12:01 PM Oliver Sang <oliver.sang@xxxxxxxxx> wrote:
>
> hi, Hyeonggon Yoo,
>
> On Tue, Jul 18, 2023 at 03:43:16PM +0900, Hyeonggon Yoo wrote:
> > On Mon, Jul 17, 2023 at 10:41 PM kernel test robot
> > <oliver.sang@xxxxxxxxx> wrote:
> > >
> > >
> > >
> > > Hello,
> > >
> > > kernel test robot noticed a -12.5% regression of hackbench.throughput on:
> > >
> > >
> > > commit: a0fd217e6d6fbd23e91f8796787b621e7d576088 ("[PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Jay-Patel/mm-slub-Optimize-slub-memory-usage/20230628-180050
> > > base: git://git.kernel.org/cgit/linux/kernel/git/vbabka/slab.git for-next
> > > patch link: https://lore.kernel.org/all/20230628095740.589893-1-jaypatel@xxxxxxxxxxxxx/
> > > patch subject: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage
> > >
> > > testcase: hackbench
> > > test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > > parameters:
> > >
> > >         nr_threads: 100%
> > >         iterations: 4
> > >         mode: process
> > >         ipc: socket
> > >         cpufreq_governor: performance
> > >
> > >
> > >
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > | Closes: https://lore.kernel.org/oe-lkp/202307172140.3b34825a-oliver.sang@xxxxxxxxx
> > >
> > >
> > > Details are as below:
> > > -------------------------------------------------------------------------------------------------->
> > >
> > >
> > > To reproduce:
> > >
> > >         git clone https://github.com/intel/lkp-tests.git
> > >         cd lkp-tests
> > >         sudo bin/lkp install job.yaml           # job file is attached in this email
> > >         bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> > >         sudo bin/lkp run generated-yaml-file
> > >
> > >         # if come across any failure that blocks the test,
> > >         # please remove ~/.lkp and /lkp dir to run from a clean state.
> > >
> > > =========================================================================================
> > > compiler/cpufreq_governor/ipc/iterations/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
> > >   gcc-12/performance/socket/4/x86_64-rhel-8.3/process/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp2/hackbench
> > >
> > > commit:
> > >   7bc162d5cc ("Merge branches 'slab/for-6.5/prandom', 'slab/for-6.5/slab_no_merge' and 'slab/for-6.5/slab-deprecate' into slab/for-next")
> > >   a0fd217e6d ("mm/slub: Optimize slub memory usage")
> > >
> > > 7bc162d5cc4de5c3 a0fd217e6d6fbd23e91f8796787
> > > ---------------- ---------------------------
> > >          %stddev     %change         %stddev
> > >              \          |                \
> > >     222503 ą 86%    +108.7%     464342 ą 58%  numa-meminfo.node1.Active
> > >     222459 ą 86%    +108.7%     464294 ą 58%  numa-meminfo.node1.Active(anon)
> > >      55573 ą 85%    +108.0%     115619 ą 58%  numa-vmstat.node1.nr_active_anon
> > >      55573 ą 85%    +108.0%     115618 ą 58%  numa-vmstat.node1.nr_zone_active_anon
> >
> > I'm quite baffled while reading this.
> > How did changing slab order calculation double the number of active anon pages?
> > I doubt two experiments were performed on the same settings.
>
> let me introduce our test process.
>
> we make sure the tests upon commit and its parent have exact same environment
> except the kernel difference, and we also make sure the config to build the
> commit and its parent are identical.
>
> we run tests for one commit at least 6 times to make sure the data is stable.
>
> such like for this case, we rebuild the commit and its parent's kernel, the
> config is attached FYI.

Hello Oliver,

Thank you for confirming the testing environment is totally fine.
and I'm sorry. I didn't mean to offend that your tests were bad.

It was more like  "oh, the data totally doesn't make sense to me"
and I blamed the tests rather than my poor understanding of the data ;)

Anyway,
as the data shows a repeatable regression,
let's think more about the possible scenario:

I can't stop thinking that the patch must've affected the system's
reclamation behavior in some way.
(I think more active anon pages with a similar number total of anon
pages implies the kernel scanned more pages)

It might be because kswapd was more frequently woken up (possible if
skbs were allocated with GFP_ATOMIC)
But the data provided is not enough to support this argument.

>  2.43 ± 7% +4.5 6.90 ± 11% perf-profile.children.cycles-pp.get_partial_node
>  3.23 ±  5%      +4.5        7.77 ±  9%  perf-profile.children.cycles-pp.___slab_alloc
>  7.51 ±  2%      +4.6       12.11 ±  5%  perf-profile.children.cycles-pp.kmalloc_reserve
> 6.94 ±  2%      +4.7       11.62 ±  6%  perf-profile.children.cycles-pp.__kmalloc_node_track_caller
> 6.46 ±  2%      +4.8       11.22 ±  6%  perf-profile.children.cycles-pp.__kmem_cache_alloc_node
>  8.48 ±  4%      +7.9       16.42 ±  8%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
>  6.12 ±  6%      +8.6       14.74 ±  9%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath

And this increased cycles in the SLUB slowpath implies that the actual
number of objects available in
the per cpu partial list has been decreased, possibly because of
inaccuracy in the heuristic?
(cuz the assumption that slabs cached per are half-filled, and that
slabs' order is s->oo)

Any thoughts, Vlastimil or Jay?

>
> then retest on this test machine:
> 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
>
> we noticed the regression still exists (datail comparison is attached
> as hackbench-a0fd217e6d-ICL-Gold-6338):
>
> =========================================================================================
> compiler/cpufreq_governor/ipc/iterations/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
>   gcc-12/performance/socket/4/x86_64-rhel-8.3/process/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp2/hackbench
>
> 7bc162d5cc4de5c3 a0fd217e6d6fbd23e91f8796787
> ---------------- ---------------------------
>          %stddev     %change         %stddev
>              \          |                \
>     479042           -12.5%     419357        hackbench.throughput
>
> the real data is as below,
>
> for 7bc162d5cc:
>   "hackbench.throughput": [
>     480199.7631014502,
>     478713.21886768367,
>     480692.1967633392,
>     476795.9313413859,
>     478545.2225235285,
>     479309.7938967886
>   ],
>
> for a0fd217e6d:
>   "hackbench.throughput": [
>     422654.2688081149,
>     419017.82222470525,
>     416817.183983105,
>     423286.39557524625,
>     414307.41610274825,
>     420062.1692010417
>   ],
>
>
> we also rerun the tests on another test machine:
> 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
>
> still found a regression
> (detail as attached hackbench-a0fd217e6d-ICL-Platinum-8358):
>
> =========================================================================================
> compiler/cpufreq_governor/ipc/iterations/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
>   gcc-12/performance/socket/4/x86_64-rhel-8.3/process/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/hackbench
>
> 7bc162d5cc4de5c3 a0fd217e6d6fbd23e91f8796787
> ---------------- ---------------------------
>          %stddev     %change         %stddev
>              \          |                \
>     455347            -5.9%     428458        hackbench.throughput
>
>
> >
> > >    1377834 ą  2%     -10.7%    1230013        sched_debug.cpu.nr_switches.avg
> > >    1218144 ą  2%     -13.3%    1055659 ą  2%  sched_debug.cpu.nr_switches.min
> > >    3047631 ą  2%     -13.2%    2646560        vmstat.system.cs
> > >     561797           -13.8%     484137        vmstat.system.in
> > >     280976 ą 66%    +122.6%     625459 ą 52%  meminfo.Active
> > >     280881 ą 66%    +122.6%     625365 ą 52%  meminfo.Active(anon)
> > >     743351 ą  4%      -9.7%     671534 ą  6%  meminfo.AnonPages
> > >       1.36            -0.1        1.21        mpstat.cpu.all.irq%
> > >       0.04 ą  4%      -0.0        0.03 ą  4%  mpstat.cpu.all.soft%
> > >       5.38            -0.8        4.58        mpstat.cpu.all.usr%
> > >       0.26           -11.9%       0.23        turbostat.IPC
> > >     160.93           -19.3      141.61        turbostat.PKG_%
> > >      60.48            -8.9%      55.10        turbostat.RAMWatt
> > >      70049 ą 68%    +124.5%     157279 ą 52%  proc-vmstat.nr_active_anon
> > >     185963 ą  4%      -9.8%     167802 ą  6%  proc-vmstat.nr_anon_pages
> > >      37302            -1.2%      36837        proc-vmstat.nr_slab_reclaimable
> > >      70049 ą 68%    +124.5%     157279 ą 52%  proc-vmstat.nr_zone_active_anon
> > >    1101451           +12.0%    1233638        proc-vmstat.unevictable_pgs_scanned
> > >     477310           -12.5%     417480        hackbench.throughput
> > >     464064           -12.0%     408333        hackbench.throughput_avg
> > >     477310           -12.5%     417480        hackbench.throughput_best
> > >     435294            -9.5%     394098        hackbench.throughput_worst
> > >     131.28           +13.4%     148.89        hackbench.time.elapsed_time
> > >     131.28           +13.4%     148.89        hackbench.time.elapsed_time.max
> > >   90404617            -5.2%   85662614 ą  2%  hackbench.time.involuntary_context_switches
> > >      15342           +15.0%      17642        hackbench.time.system_time
> > >     866.32            -3.2%     838.32        hackbench.time.user_time
> > >  4.581e+10           -11.2%  4.069e+10        perf-stat.i.branch-instructions
> > >       0.45            +0.1        0.56        perf-stat.i.branch-miss-rate%
> > >  2.024e+08           +11.8%  2.263e+08        perf-stat.i.branch-misses
> > >      21.49            -1.1       20.42        perf-stat.i.cache-miss-rate%
> > >  4.202e+08           -16.6%  3.505e+08        perf-stat.i.cache-misses
> > >  1.935e+09           -11.5%  1.711e+09        perf-stat.i.cache-references
> > >    3115707 ą  2%     -13.9%    2681887        perf-stat.i.context-switches
> > >       1.31           +13.2%       1.48        perf-stat.i.cpi
> > >     375155 ą  3%     -16.3%     314001 ą  2%  perf-stat.i.cpu-migrations
> > >  6.727e+10           -11.2%  5.972e+10        perf-stat.i.dTLB-loads
> > >  4.169e+10           -12.2%  3.661e+10        perf-stat.i.dTLB-stores
> > >  2.465e+11           -11.4%  2.185e+11        perf-stat.i.instructions
> > >       0.77           -11.8%       0.68        perf-stat.i.ipc
> > >     818.18 ą  5%     +61.8%       1323 ą  2%  perf-stat.i.metric.K/sec
> > >       1225           -11.6%       1083        perf-stat.i.metric.M/sec
> > >      11341 ą  4%     -12.6%       9916 ą  4%  perf-stat.i.minor-faults
> > >   1.27e+08           -13.2%  1.102e+08        perf-stat.i.node-load-misses
> > >    3376198           -15.4%    2855906        perf-stat.i.node-loads
> > >   72756698           -22.9%   56082330        perf-stat.i.node-store-misses
> > >    4118986 ą  2%     -19.3%    3322276        perf-stat.i.node-stores
> > >      11432 ą  3%     -12.6%       9991 ą  4%  perf-stat.i.page-faults
> > >       0.44            +0.1        0.56        perf-stat.overall.branch-miss-rate%
> > >      21.76            -1.3       20.49        perf-stat.overall.cache-miss-rate%
> > >       1.29           +13.5%       1.47        perf-stat.overall.cpi
> > >     755.39           +21.1%     914.82        perf-stat.overall.cycles-between-cache-misses
> > >       0.77           -11.9%       0.68        perf-stat.overall.ipc
> > >  4.546e+10           -11.0%  4.046e+10        perf-stat.ps.branch-instructions
> > >  2.006e+08           +12.0%  2.246e+08        perf-stat.ps.branch-misses
> > >  4.183e+08           -16.8%   3.48e+08        perf-stat.ps.cache-misses
> > >  1.923e+09           -11.7%  1.699e+09        perf-stat.ps.cache-references
> > >    3073921 ą  2%     -13.9%    2647497        perf-stat.ps.context-switches
> > >     367849 ą  3%     -16.1%     308496 ą  2%  perf-stat.ps.cpu-migrations
> > >  6.683e+10           -11.2%  5.938e+10        perf-stat.ps.dTLB-loads
> > >  4.144e+10           -12.2%  3.639e+10        perf-stat.ps.dTLB-stores
> > >  2.447e+11           -11.2%  2.172e+11        perf-stat.ps.instructions
> > >      10654 ą  4%     -11.5%       9428 ą  4%  perf-stat.ps.minor-faults
> > >  1.266e+08           -13.5%  1.095e+08        perf-stat.ps.node-load-misses
> > >    3361116           -15.6%    2836863        perf-stat.ps.node-loads
> > >   72294146           -23.1%   55573600        perf-stat.ps.node-store-misses
> > >    4043240 ą  2%     -19.4%    3258771        perf-stat.ps.node-stores
> > >      10734 ą  4%     -11.6%       9494 ą  4%  perf-stat.ps.page-faults
> >
> > <...>
> >
> > >
> > > Disclaimer:
> > > Results have been estimated based on internal Intel analysis and are provided
> > > for informational purposes only. Any difference in system hardware or software
> > > design or configuration may affect actual performance.
> > >
> > >
> > > --
> > > 0-DAY CI Kernel Test Service
> > > https://github.com/intel/lkp-tests/wiki
> > >
> > >
> >





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux