On 27.02.2023 22:32, Roman Gushchin wrote: > On Mon, Feb 27, 2023 at 10:20:59PM +0300, Kirill Tkhai wrote: >> On 27.02.2023 18:08, Mike Rapoport wrote: >>> Hi, >>> >>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >>>> >>>> >>>> On 2023/2/27 03:51, Andrew Morton wrote: >>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> This patch series aims to make slab shrink lockless. >>>>> >>>>> What an awesome changelog. >>>>> >>>>>> 2. Survey >>>>>> ========= >>>>> >>>>> Especially this part. >>>>> >>>>> Looking through all the prior efforts and at this patchset I am not >>>>> immediately seeing any statements about the overall effect upon >>>>> real-world workloads. For a good example, does this patchset >>>>> measurably improve throughput or energy consumption on your servers? >>>> >>>> Hi Andrew, >>>> >>>> I re-tested with the following physical machines: >>>> >>>> Architecture: x86_64 >>>> CPU(s): 96 >>>> On-line CPU(s) list: 0-95 >>>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >>>> >>>> I found that the reason for the hotspot I described in cover letter is >>>> wrong. The reason for the down_read_trylock() hotspot is not because of >>>> the failure to trylock, but simply because of the atomic operation >>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn >>>> per cycle). >>> >>> ... >>> >>>> Then we can use the following perf command to view hotspots: >>>> >>>> perf top -U -F 999 >>>> >>>> 1) Before applying this patchset: >>>> >>>> 32.31% [kernel] [k] down_read_trylock >>>> 19.40% [kernel] [k] pv_native_safe_halt >>>> 16.24% [kernel] [k] up_read >>>> 15.70% [kernel] [k] shrink_slab >>>> 4.69% [kernel] [k] _find_next_bit >>>> 2.62% [kernel] [k] shrink_node >>>> 1.78% [kernel] [k] shrink_lruvec >>>> 0.76% [kernel] [k] do_shrink_slab >>>> >>>> 2) After applying this patchset: >>>> >>>> 27.83% [kernel] [k] _find_next_bit >>>> 16.97% [kernel] [k] shrink_slab >>>> 15.82% [kernel] [k] pv_native_safe_halt >>>> 9.58% [kernel] [k] shrink_node >>>> 8.31% [kernel] [k] shrink_lruvec >>>> 5.64% [kernel] [k] do_shrink_slab >>>> 3.88% [kernel] [k] mem_cgroup_iter >>>> >>>> 2. At the same time, we use the following perf command to capture IPC >>>> information: >>>> >>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >>>> >>>> 1) Before applying this patchset: >>>> >>>> Performance counter stats for 'system wide' (5 runs): >>>> >>>> 454187219766 cycles test ( >>>> +- 1.84% ) >>>> 78896433101 instructions test # 0.17 insn per >>>> cycle ( +- 0.44% ) >>>> >>>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) >>>> >>>> 2) After applying this patchset: >>>> >>>> Performance counter stats for 'system wide' (5 runs): >>>> >>>> 841954709443 cycles test ( >>>> +- 15.80% ) (98.69%) >>>> 527258677936 instructions test # 0.63 insn per >>>> cycle ( +- 15.11% ) (98.68%) >>>> >>>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) >>>> >>>> We can see that IPC drops very seriously when calling >>>> down_read_trylock() at high frequency. After using SRCU, >>>> the IPC is at a normal level. >>> >>> The results you present do show improvement in IPC for an artificial test >>> script. But more interesting would be to see how a real world workloads >>> benefit from your changes. >> >> One of the real workloads from my experience is start of an overcommitted node >> containing many starting containers after node crash (or many resuming containers >> after reboot for kernel update). In these cases memory pressure is huge, and >> the node goes round in long reclaim. >> >> This patch patchset makes prealloc_memcg_shrinker() independent of do_shrink_slab(), >> so prealloc_memcg_shrinker() won't have to wait till shrink_slab_memcg() completes its >> current bit iteration, sees rwsem_is_contended() and the iteration breaks. >> >> Also, it's important to mention that currently we have the strange behavior: >> >> prealloc_memcg_shrinker() >> down_write(&shrinker_rwsem) >> idr_alloc() >> reclaim >> for each child memcg >> shrink_slab_memcg() >> down_read_trylock(&shrinker_rwsem) -> fail > > But this can happen only if we get -ENOMEM in idr_alloc()? > Doesn't seem to be a very hot path. There is not only idr_alloc(), but expand_shrinker_info() too. The last is more heavier. But despite that, yes, it's not a hot path. The memory pressure on overcommited node start I described above is a regular situation. There are lots of register_shrinker() contending with reclaim.