The patch titled Subject: mm: shrinker: make memcg slab shrink lockless has been added to the -mm mm-unstable branch. Its filename is mm-shrinker-make-memcg-slab-shrink-lockless.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-shrinker-make-memcg-slab-shrink-lockless.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> Subject: mm: shrinker: make memcg slab shrink lockless Date: Mon, 11 Sep 2023 17:44:42 +0800 Like global slab shrink, this commit also uses refcount+RCU method to make memcg slab shrink lockless. Use the following script to do slab shrink stress test: ``` DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 33.15% [kernel] [k] down_read_trylock 25.38% [kernel] [k] shrink_slab 21.75% [kernel] [k] up_read 4.45% [kernel] [k] _find_next_bit 2.27% [kernel] [k] do_shrink_slab 1.80% [kernel] [k] intel_idle_irq 1.79% [kernel] [k] shrink_lruvec 0.67% [kernel] [k] xas_descend 0.41% [kernel] [k] mem_cgroup_iter 0.40% [kernel] [k] shrink_node 0.38% [kernel] [k] list_lru_count_one 2) After applying this patchset: 64.56% [kernel] [k] shrink_slab 12.18% [kernel] [k] do_shrink_slab 3.30% [kernel] [k] __rcu_read_unlock 2.61% [kernel] [k] shrink_lruvec 2.49% [kernel] [k] __rcu_read_lock 1.93% [kernel] [k] intel_idle_irq 0.89% [kernel] [k] shrink_node 0.81% [kernel] [k] mem_cgroup_iter 0.77% [kernel] [k] mem_cgroup_calculate_protection 0.66% [kernel] [k] list_lru_count_one We can see that the first perf hotspot becomes shrink_slab, which is what we expect. Link: https://lkml.kernel.org/r/20230911094444.68966-44-zhengqi.arch@xxxxxxxxxxxxx Signed-off-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> Cc: Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> Cc: Alasdair Kergon <agk@xxxxxxxxxx> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@xxxxxxxxxxxxx> Cc: Andreas Dilger <adilger.kernel@xxxxxxxxx> Cc: Andreas Gruenbacher <agruenba@xxxxxxxxxx> Cc: Anna Schumaker <anna@xxxxxxxxxx> Cc: Arnd Bergmann <arnd@xxxxxxxx> Cc: Bob Peterson <rpeterso@xxxxxxxxxx> Cc: Borislav Petkov <bp@xxxxxxxxx> Cc: Carlos Llamas <cmllamas@xxxxxxxxxx> Cc: Chandan Babu R <chandan.babu@xxxxxxxxxx> Cc: Chao Yu <chao@xxxxxxxxxx> Cc: Chris Mason <clm@xxxxxx> Cc: Christian Brauner <brauner@xxxxxxxxxx> Cc: Christian Koenig <christian.koenig@xxxxxxx> Cc: Chuck Lever <cel@xxxxxxxxxx> Cc: Coly Li <colyli@xxxxxxx> Cc: Dai Ngo <Dai.Ngo@xxxxxxxxxx> Cc: Daniel Vetter <daniel@xxxxxxxx> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> Cc: "Darrick J. Wong" <djwong@xxxxxxxxxx> Cc: Dave Chinner <david@xxxxxxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> Cc: David Airlie <airlied@xxxxxxxxx> Cc: David Hildenbrand <david@xxxxxxxxxx> Cc: David Sterba <dsterba@xxxxxxxx> Cc: Dmitry Baryshkov <dmitry.baryshkov@xxxxxxxxxx> Cc: Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> Cc: Huang Rui <ray.huang@xxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Jaegeuk Kim <jaegeuk@xxxxxxxxxx> Cc: Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx> Cc: Jan Kara <jack@xxxxxxx> Cc: Jason Wang <jasowang@xxxxxxxxxx> Cc: Jeff Layton <jlayton@xxxxxxxxxx> Cc: Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx> Cc: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx> Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> Cc: Josef Bacik <josef@xxxxxxxxxxxxxx> Cc: Juergen Gross <jgross@xxxxxxxx> Cc: Kent Overstreet <kent.overstreet@xxxxxxxxx> Cc: Kirill Tkhai <tkhai@xxxxx> Cc: Marijn Suijten <marijn.suijten@xxxxxxxxxxxxxx> Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx> Cc: Mike Snitzer <snitzer@xxxxxxxxxx> Cc: Minchan Kim <minchan@xxxxxxxxxx> Cc: Muchun Song <muchun.song@xxxxxxxxx> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx> Cc: Nadav Amit <namit@xxxxxxxxxx> Cc: Neil Brown <neilb@xxxxxxx> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx> Cc: Olga Kornievskaia <kolga@xxxxxxxxxx> Cc: Paul E. McKenney <paulmck@xxxxxxxxxx> Cc: Richard Weinberger <richard@xxxxxx> Cc: Rob Clark <robdclark@xxxxxxxxx> Cc: Rob Herring <robh@xxxxxxxxxx> Cc: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx> Cc: Sean Paul <sean@xxxxxxxxxx> Cc: Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx> Cc: Song Liu <song@xxxxxxxxxx> Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx> Cc: Steven Price <steven.price@xxxxxxx> Cc: "Theodore Ts'o" <tytso@xxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Tomeu Vizoso <tomeu.vizoso@xxxxxxxxxxxxx> Cc: Tom Talpey <tom@xxxxxxxxxx> Cc: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Cc: Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> Cc: Yue Hu <huyue2@xxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/shrinker.c | 85 +++++++++++++++++++++++++++++++++++++----------- 1 file changed, 66 insertions(+), 19 deletions(-) --- a/mm/shrinker.c~mm-shrinker-make-memcg-slab-shrink-lockless +++ a/mm/shrinker.c @@ -218,7 +218,6 @@ static int shrinker_memcg_alloc(struct s return -ENOSYS; down_write(&shrinker_rwsem); - /* This may call shrinker, so it must use down_read_trylock() */ id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); if (id < 0) goto unlock; @@ -252,10 +251,15 @@ static long xchg_nr_deferred_memcg(int n { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; - info = shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0); + nr_deferred = atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0); + rcu_read_unlock(); + + return nr_deferred; } static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, @@ -263,10 +267,16 @@ static long add_nr_deferred_memcg(long n { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; - info = shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]); + nr_deferred = + atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]); + rcu_read_unlock(); + + return nr_deferred; } void reparent_shrinker_deferred(struct mem_cgroup *memcg) @@ -463,18 +473,54 @@ static unsigned long shrink_slab_memcg(g if (!mem_cgroup_online(memcg)) return 0; - if (!down_read_trylock(&shrinker_rwsem)) - return 0; - - info = shrinker_info_protected(memcg, nid); + /* + * lockless algorithm of memcg shrink. + * + * The shrinker_info may be freed asynchronously via RCU in the + * expand_one_shrinker_info(), so the rcu_read_lock() needs to be used + * to ensure the existence of the shrinker_info. + * + * The shrinker_info_unit is never freed unless its corresponding memcg + * is destroyed. Here we already hold the refcount of memcg, so the + * memcg will not be destroyed, and of course shrinker_info_unit will + * not be freed. + * + * So in the memcg shrink: + * step 1: use rcu_read_lock() to guarantee existence of the + * shrinker_info. + * step 2: after getting shrinker_info_unit we can safely release the + * RCU lock. + * step 3: traverse the bitmap and calculate shrinker_id + * step 4: use rcu_read_lock() to guarantee existence of the shrinker. + * step 5: use shrinker_id to find the shrinker, then use + * shrinker_try_get() to guarantee existence of the shrinker, + * then we can release the RCU lock to do do_shrink_slab() that + * may sleep. + * step 6: do shrinker_put() paired with step 5 to put the refcount, + * if the refcount reaches 0, then wake up the waiter in + * shrinker_free() by calling complete(). + * Note: here is different from the global shrink, we don't + * need to acquire the RCU lock to guarantee existence of + * the shrinker, because we don't need to use this + * shrinker to traverse the next shrinker in the bitmap. + * step 7: we have already exited the read-side of rcu critical section + * before calling do_shrink_slab(), the shrinker_info may be + * released in expand_one_shrinker_info(), so go back to step 1 + * to reacquire the shrinker_info. + */ +again: + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); if (unlikely(!info)) goto unlock; - for (; index < shrinker_id_to_index(info->map_nr_max); index++) { + if (index < shrinker_id_to_index(info->map_nr_max)) { struct shrinker_info_unit *unit; unit = info->unit[index]; + rcu_read_unlock(); + for_each_set_bit(offset, unit->map, SHRINKER_UNIT_BITS) { struct shrink_control sc = { .gfp_mask = gfp_mask, @@ -484,12 +530,14 @@ static unsigned long shrink_slab_memcg(g struct shrinker *shrinker; int shrinker_id = calc_shrinker_id(index, offset); + rcu_read_lock(); shrinker = idr_find(&shrinker_idr, shrinker_id); - if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) { - if (!shrinker) - clear_bit(offset, unit->map); + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) { + clear_bit(offset, unit->map); + rcu_read_unlock(); continue; } + rcu_read_unlock(); /* Call non-slab shrinkers even though kmem is disabled */ if (!memcg_kmem_online() && @@ -522,15 +570,14 @@ static unsigned long shrink_slab_memcg(g set_shrinker_bit(memcg, nid, shrinker_id); } freed += ret; - - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - goto unlock; - } + shrinker_put(shrinker); } + + index++; + goto again; } unlock: - up_read(&shrinker_rwsem); + rcu_read_unlock(); return freed; } #else /* !CONFIG_MEMCG */ _ Patches currently in -mm which might be from zhengqi.arch@xxxxxxxxxxxxx are mm-move-some-shrinker-related-function-declarations-to-mm-internalh.patch mm-vmscan-move-shrinker-related-code-into-a-separate-file.patch mm-shrinker-remove-redundant-shrinker_rwsem-in-debugfs-operations.patch drm-ttm-introduce-pool_shrink_rwsem.patch mm-shrinker-add-infrastructure-for-dynamically-allocating-shrinker.patch kvm-mmu-dynamically-allocate-the-x86-mmu-shrinker.patch binder-dynamically-allocate-the-android-binder-shrinker.patch drm-ttm-dynamically-allocate-the-drm-ttm_pool-shrinker.patch xenbus-backend-dynamically-allocate-the-xen-backend-shrinker.patch erofs-dynamically-allocate-the-erofs-shrinker.patch f2fs-dynamically-allocate-the-f2fs-shrinker.patch gfs2-dynamically-allocate-the-gfs2-glock-shrinker.patch gfs2-dynamically-allocate-the-gfs2-qd-shrinker.patch nfsv42-dynamically-allocate-the-nfs-xattr-shrinkers.patch nfs-dynamically-allocate-the-nfs-acl-shrinker.patch nfsd-dynamically-allocate-the-nfsd-filecache-shrinker.patch quota-dynamically-allocate-the-dquota-cache-shrinker.patch ubifs-dynamically-allocate-the-ubifs-slab-shrinker.patch rcu-dynamically-allocate-the-rcu-lazy-shrinker.patch rcu-dynamically-allocate-the-rcu-kfree-shrinker.patch mm-thp-dynamically-allocate-the-thp-related-shrinkers.patch sunrpc-dynamically-allocate-the-sunrpc_cred-shrinker.patch mm-workingset-dynamically-allocate-the-mm-shadow-shrinker.patch drm-i915-dynamically-allocate-the-i915_gem_mm-shrinker.patch drm-msm-dynamically-allocate-the-drm-msm_gem-shrinker.patch drm-panfrost-dynamically-allocate-the-drm-panfrost-shrinker.patch dm-dynamically-allocate-the-dm-bufio-shrinker.patch dm-zoned-dynamically-allocate-the-dm-zoned-meta-shrinker.patch md-raid5-dynamically-allocate-the-md-raid5-shrinker.patch bcache-dynamically-allocate-the-md-bcache-shrinker.patch vmw_balloon-dynamically-allocate-the-vmw-balloon-shrinker.patch virtio_balloon-dynamically-allocate-the-virtio-balloon-shrinker.patch mbcache-dynamically-allocate-the-mbcache-shrinker.patch ext4-dynamically-allocate-the-ext4-es-shrinker.patch jbd2ext4-dynamically-allocate-the-jbd2-journal-shrinker.patch nfsd-dynamically-allocate-the-nfsd-client-shrinker.patch nfsd-dynamically-allocate-the-nfsd-reply-shrinker.patch xfs-dynamically-allocate-the-xfs-buf-shrinker.patch xfs-dynamically-allocate-the-xfs-inodegc-shrinker.patch xfs-dynamically-allocate-the-xfs-qm-shrinker.patch zsmalloc-dynamically-allocate-the-mm-zspool-shrinker.patch fs-super-dynamically-allocate-the-s_shrink.patch mm-shrinker-remove-old-apis.patch mm-shrinker-add-a-secondary-array-for-shrinker_info-map-nr_deferred.patch mm-shrinker-rename-preallocunregister_memcg_shrinker-to-shrinker_memcg_allocremove.patch mm-shrinker-make-global-slab-shrink-lockless.patch mm-shrinker-make-memcg-slab-shrink-lockless.patch mm-shrinker-hold-write-lock-to-reparent-shrinker-nr_deferred.patch mm-shrinker-convert-shrinker_rwsem-to-mutex.patch