On 3/7/23 07:56, Qi Zheng wrote: > Like global slab shrink, this commit also uses SRCU to make > memcg slab shrink lockless. > > We can reproduce the down_read_trylock() hotspot through the > following script: > > ``` > > DIR="/root/shrinker/memcg/mnt" > > do_create() > { > mkdir -p /sys/fs/cgroup/memory/test > mkdir -p /sys/fs/cgroup/perf_event/test > echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes > for i in `seq 0 $1`; > do > mkdir -p /sys/fs/cgroup/memory/test/$i; > echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; > echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; > mkdir -p $DIR/$i; > done > } > > do_mount() > { > for i in `seq $1 $2`; > do > mount -t tmpfs $i $DIR/$i; > done > } > > do_touch() > { > for i in `seq $1 $2`; > do > echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; > echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; > dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & > done > } > > case "$1" in > touch) > do_touch $2 $3 > ;; > test) > do_create 4000 > do_mount 0 4000 > do_touch 0 3000 > ;; > *) > exit 1 > ;; > esac > ``` > > Save the above script, then run test and touch commands. > Then we can use the following perf command to view hotspots: > > perf top -U -F 999 > > 1) Before applying this patchset: > > 32.31% [kernel] [k] down_read_trylock > 19.40% [kernel] [k] pv_native_safe_halt > 16.24% [kernel] [k] up_read > 15.70% [kernel] [k] shrink_slab > 4.69% [kernel] [k] _find_next_bit > 2.62% [kernel] [k] shrink_node > 1.78% [kernel] [k] shrink_lruvec > 0.76% [kernel] [k] do_shrink_slab > > 2) After applying this patchset: > > 27.83% [kernel] [k] _find_next_bit > 16.97% [kernel] [k] shrink_slab > 15.82% [kernel] [k] pv_native_safe_halt > 9.58% [kernel] [k] shrink_node > 8.31% [kernel] [k] shrink_lruvec > 5.64% [kernel] [k] do_shrink_slab > 3.88% [kernel] [k] mem_cgroup_iter > > At the same time, we use the following perf command to capture > IPC information: > > perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 > > 1) Before applying this patchset: > > Performance counter stats for 'system wide' (5 runs): > > 454187219766 cycles test ( +- 1.84% ) > 78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% ) > > 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) > > 2) After applying this patchset: > > Performance counter stats for 'system wide' (5 runs): > > 841954709443 cycles test ( +- 15.80% ) (98.69%) > 527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%) > > 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) > > We can see that IPC drops very seriously when calling > down_read_trylock() at high frequency. After using SRCU, > the IPC is at a normal level. The interpretation looks somewhat weird to me. I'd say the workload is stalled a lot as it fails the trylock (there might be some optimistic spinning perhaps) and then goes to sleep. See how "pv_native_safe_halt" is also more prominent in before. And because of that sleeping, there's less instructions executed in the same amount of cycles (as it's a system wide collection, otherwise it wouldn't be collecting the sleeping processes). > > Signed-off-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> Other than that: Acked-by: Vlastimil Babka <Vbabka@xxxxxxx> A small thing below: > --- > mm/vmscan.c | 46 +++++++++++++++++++++++++++------------------- > 1 file changed, 27 insertions(+), 19 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 8515ac40bcaf..1de9bc3e5aa2 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -57,6 +57,7 @@ > #include <linux/khugepaged.h> > #include <linux/rculist_nulls.h> > #include <linux/random.h> > +#include <linux/srcu.h> I guess this should have been in patch 2/8 already? It may work accidentaly because some other header pulls it transitively...