Re: [PATCH v3 0/8] make slab shrink lockless

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2023/2/28 03:20, Kirill Tkhai wrote:
On 27.02.2023 18:08, Mike Rapoport wrote:
Hi,

On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:


On 2023/2/27 03:51, Andrew Morton wrote:
On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:

Hi all,

This patch series aims to make slab shrink lockless.

What an awesome changelog.

2. Survey
=========

Especially this part.

Looking through all the prior efforts and at this patchset I am not
immediately seeing any statements about the overall effect upon
real-world workloads.  For a good example, does this patchset
measurably improve throughput or energy consumption on your servers?

Hi Andrew,

I re-tested with the following physical machines:

Architecture:        x86_64
CPU(s):              96
On-line CPU(s) list: 0-95
Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz

I found that the reason for the hotspot I described in cover letter is
wrong. The reason for the down_read_trylock() hotspot is not because of
the failure to trylock, but simply because of the atomic operation
(cmpxchg). And this will lead to a significant reduction in IPC (insn
per cycle).

...
Then we can use the following perf command to view hotspots:

perf top -U -F 999

1) Before applying this patchset:

   32.31%  [kernel]           [k] down_read_trylock
   19.40%  [kernel]           [k] pv_native_safe_halt
   16.24%  [kernel]           [k] up_read
   15.70%  [kernel]           [k] shrink_slab
    4.69%  [kernel]           [k] _find_next_bit
    2.62%  [kernel]           [k] shrink_node
    1.78%  [kernel]           [k] shrink_lruvec
    0.76%  [kernel]           [k] do_shrink_slab

2) After applying this patchset:

   27.83%  [kernel]           [k] _find_next_bit
   16.97%  [kernel]           [k] shrink_slab
   15.82%  [kernel]           [k] pv_native_safe_halt
    9.58%  [kernel]           [k] shrink_node
    8.31%  [kernel]           [k] shrink_lruvec
    5.64%  [kernel]           [k] do_shrink_slab
    3.88%  [kernel]           [k] mem_cgroup_iter

2. At the same time, we use the following perf command to capture IPC
information:

perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10

1) Before applying this patchset:

  Performance counter stats for 'system wide' (5 runs):

       454187219766      cycles                    test                    (
+-  1.84% )
        78896433101      instructions              test #    0.17  insn per
cycle           ( +-  0.44% )

         10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )

2) After applying this patchset:

  Performance counter stats for 'system wide' (5 runs):

       841954709443      cycles                    test                    (
+- 15.80% )  (98.69%)
       527258677936      instructions              test #    0.63  insn per
cycle           ( +- 15.11% )  (98.68%)

           10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )

We can see that IPC drops very seriously when calling
down_read_trylock() at high frequency. After using SRCU,
the IPC is at a normal level.

The results you present do show improvement in IPC for an artificial test
script. But more interesting would be to see how a real world workloads
benefit from your changes.

One of the real workloads from my experience is start of an overcommitted node
containing many starting containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory pressure is huge, and
the node goes round in long reclaim.

Thanks a lot for providing this real workload! :)


This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
current bit iteration, sees rwsem_is_contended() and the iteration breaks.

Also, it's important to mention that currently we have the strange behavior:

prealloc_memcg_shrinker()
   down_write(&shrinker_rwsem)
   idr_alloc()
     reclaim
       for each child memcg
         shrink_slab_memcg()
           down_read_trylock(&shrinker_rwsem) -> fail

All the slab reclaim in this behavior is just a parasite work, and it just wastes
our cpu time, which does not look a good design.

Kirill

--
Thanks,
Qi




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux