Hi, On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: > > > On 2023/2/27 03:51, Andrew Morton wrote: > > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote: > > > > > Hi all, > > > > > > This patch series aims to make slab shrink lockless. > > > > What an awesome changelog. > > > > > 2. Survey > > > ========= > > > > Especially this part. > > > > Looking through all the prior efforts and at this patchset I am not > > immediately seeing any statements about the overall effect upon > > real-world workloads. For a good example, does this patchset > > measurably improve throughput or energy consumption on your servers? > > Hi Andrew, > > I re-tested with the following physical machines: > > Architecture: x86_64 > CPU(s): 96 > On-line CPU(s) list: 0-95 > Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz > > I found that the reason for the hotspot I described in cover letter is > wrong. The reason for the down_read_trylock() hotspot is not because of > the failure to trylock, but simply because of the atomic operation > (cmpxchg). And this will lead to a significant reduction in IPC (insn > per cycle). ... > Then we can use the following perf command to view hotspots: > > perf top -U -F 999 > > 1) Before applying this patchset: > > 32.31% [kernel] [k] down_read_trylock > 19.40% [kernel] [k] pv_native_safe_halt > 16.24% [kernel] [k] up_read > 15.70% [kernel] [k] shrink_slab > 4.69% [kernel] [k] _find_next_bit > 2.62% [kernel] [k] shrink_node > 1.78% [kernel] [k] shrink_lruvec > 0.76% [kernel] [k] do_shrink_slab > > 2) After applying this patchset: > > 27.83% [kernel] [k] _find_next_bit > 16.97% [kernel] [k] shrink_slab > 15.82% [kernel] [k] pv_native_safe_halt > 9.58% [kernel] [k] shrink_node > 8.31% [kernel] [k] shrink_lruvec > 5.64% [kernel] [k] do_shrink_slab > 3.88% [kernel] [k] mem_cgroup_iter > > 2. At the same time, we use the following perf command to capture IPC > information: > > perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 > > 1) Before applying this patchset: > > Performance counter stats for 'system wide' (5 runs): > > 454187219766 cycles test ( > +- 1.84% ) > 78896433101 instructions test # 0.17 insn per > cycle ( +- 0.44% ) > > 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) > > 2) After applying this patchset: > > Performance counter stats for 'system wide' (5 runs): > > 841954709443 cycles test ( > +- 15.80% ) (98.69%) > 527258677936 instructions test # 0.63 insn per > cycle ( +- 15.11% ) (98.68%) > > 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) > > We can see that IPC drops very seriously when calling > down_read_trylock() at high frequency. After using SRCU, > the IPC is at a normal level. The results you present do show improvement in IPC for an artificial test script. But more interesting would be to see how a real world workloads benefit from your changes. > Thanks, > Qi -- Sincerely yours, Mike.