Re: [PATCH v3 0/8] make slab shrink lockless

Kirill Tkhai <tkhai@xxxxx> · Mon, 27 Feb 2023 22:47:37 +0300

On 27.02.2023 22:32, Roman Gushchin wrote:
> On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote:
>> On 27.02.2023 18:08, Mike Rapoport wrote:
>>> Hi,
>>>
>>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2023/2/27 03:51, Andrew Morton wrote:
>>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patch series aims to make slab shrink lockless.
>>>>>
>>>>> What an awesome changelog.
>>>>>
>>>>>> 2. Survey
>>>>>> =========
>>>>>
>>>>> Especially this part.
>>>>>
>>>>> Looking through all the prior efforts and at this patchset I am not
>>>>> immediately seeing any statements about the overall effect upon
>>>>> real-world workloads.  For a good example, does this patchset
>>>>> measurably improve throughput or energy consumption on your servers?
>>>>
>>>> Hi Andrew,
>>>>
>>>> I re-tested with the following physical machines:
>>>>
>>>> Architecture:        x86_64
>>>> CPU(s):              96
>>>> On-line CPU(s) list: 0-95
>>>> Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>>>
>>>> I found that the reason for the hotspot I described in cover letter is
>>>> wrong. The reason for the down_read_trylock() hotspot is not because of
>>>> the failure to trylock, but simply because of the atomic operation
>>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>>>> per cycle).
>>>
>>> ... 
>>>  
>>>> Then we can use the following perf command to view hotspots:
>>>>
>>>> perf top -U -F 999
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>>   32.31%  [kernel]           [k] down_read_trylock
>>>>   19.40%  [kernel]           [k] pv_native_safe_halt
>>>>   16.24%  [kernel]           [k] up_read
>>>>   15.70%  [kernel]           [k] shrink_slab
>>>>    4.69%  [kernel]           [k] _find_next_bit
>>>>    2.62%  [kernel]           [k] shrink_node
>>>>    1.78%  [kernel]           [k] shrink_lruvec
>>>>    0.76%  [kernel]           [k] do_shrink_slab
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>>   27.83%  [kernel]           [k] _find_next_bit
>>>>   16.97%  [kernel]           [k] shrink_slab
>>>>   15.82%  [kernel]           [k] pv_native_safe_halt
>>>>    9.58%  [kernel]           [k] shrink_node
>>>>    8.31%  [kernel]           [k] shrink_lruvec
>>>>    5.64%  [kernel]           [k] do_shrink_slab
>>>>    3.88%  [kernel]           [k] mem_cgroup_iter
>>>>
>>>> 2. At the same time, we use the following perf command to capture IPC
>>>> information:
>>>>
>>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>>  Performance counter stats for 'system wide' (5 runs):
>>>>
>>>>       454187219766      cycles                    test                    (
>>>> +-  1.84% )
>>>>        78896433101      instructions              test #    0.17  insn per
>>>> cycle           ( +-  0.44% )
>>>>
>>>>         10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>>  Performance counter stats for 'system wide' (5 runs):
>>>>
>>>>       841954709443      cycles                    test                    (
>>>> +- 15.80% )  (98.69%)
>>>>       527258677936      instructions              test #    0.63  insn per
>>>> cycle           ( +- 15.11% )  (98.68%)
>>>>
>>>>           10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
>>>>
>>>> We can see that IPC drops very seriously when calling
>>>> down_read_trylock() at high frequency. After using SRCU,
>>>> the IPC is at a normal level.
>>>
>>> The results you present do show improvement in IPC for an artificial test
>>> script. But more interesting would be to see how a real world workloads
>>> benefit from your changes.
>>
>> One of the real workloads from my experience is start of an overcommitted node
>> containing many starting containers after node crash (or many resuming containers
>> after reboot for kernel update). In these cases memory pressure is huge, and
>> the node goes round in long reclaim.
>>
>> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(),
>> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its
>> current bit iteration, sees rwsem_is_contended() and the iteration breaks.
>>
>> Also, it's important to mention that currently we have the strange behavior:
>>
>> prealloc_memcg_shrinker()
>>   down_write(&shrinker_rwsem)
>>   idr_alloc()
>>     reclaim
>>       for each child memcg
>>         shrink_slab_memcg()
>>           down_read_trylock(&shrinker_rwsem) -> fail
> 
> But this can happen only if we get -ENOMEM in idr_alloc()?
> Doesn't seem to be a very hot path.

There is not only idr_alloc(), but expand_shrinker_info() too. The last is more heavier.
But despite that, yes, it's not a hot path.

The memory pressure on overcommited node start I described above is a regular situation.
There are lots of register_shrinker() contending with reclaim.