Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless

Sultan Alsawaf <sultan@xxxxxxxxxxxxxxx> · Fri, 24 Feb 2023 00:20:12 -0800

On Fri, Feb 24, 2023 at 12:00:21PM +0800, Qi Zheng wrote:
> 
> 
> On 2023/2/24 02:24, Sultan Alsawaf wrote:
> > On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
> > > The shrinker_rwsem is a global lock in shrinkers subsystem,
> > > it is easy to cause blocking in the following cases:
> > > 
> > > a. the write lock of shrinker_rwsem was held for too long.
> > >     For example, there are many memcgs in the system, which
> > >     causes some paths to hold locks and traverse it for too
> > >     long. (e.g. expand_shrinker_info())
> > > b. the read lock of shrinker_rwsem was held for too long,
> > >     and a writer came at this time. Then this writer will be
> > >     forced to wait and block all subsequent readers.
> > >     For example:
> > >     - be scheduled when the read lock of shrinker_rwsem is
> > >       held in do_shrink_slab()
> > >     - some shrinker are blocked for too long. Like the case
> > >       mentioned in the patchset[1].
> > > 
> > > Therefore, many times in history ([2],[3],[4],[5]), some
> > > people wanted to replace shrinker_rwsem reader with SRCU,
> > > but they all gave up because SRCU was not unconditionally
> > > enabled.
> > > 
> > > But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> > > the SRCU is unconditionally enabled. So it's time to use
> > > SRCU to protect readers who previously held shrinker_rwsem.
> > > 
> > > [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@xxxxxxxxxxxxx/
> > > [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@xxxxxxxxxxxx/
> > > [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx/
> > > [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> > > [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@xxxxxxxxxxxxxxx/
> > > 
> > > Signed-off-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
> > > ---
> > >   mm/vmscan.c | 27 +++++++++++----------------
> > >   1 file changed, 11 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 9f895ca6216c..02987a6f95d1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
> > >   LIST_HEAD(shrinker_list);
> > >   DECLARE_RWSEM(shrinker_rwsem);
> > > +DEFINE_SRCU(shrinker_srcu);
> > >   #ifdef CONFIG_MEMCG
> > >   static int shrinker_nr_max;
> > > @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
> > >   void register_shrinker_prepared(struct shrinker *shrinker)
> > >   {
> > >   	down_write(&shrinker_rwsem);
> > > -	list_add_tail(&shrinker->list, &shrinker_list);
> > > +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > >   	shrinker->flags |= SHRINKER_REGISTERED;
> > >   	shrinker_debugfs_add(shrinker);
> > >   	up_write(&shrinker_rwsem);
> > > @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
> > >   		return;
> > >   	down_write(&shrinker_rwsem);
> > > -	list_del(&shrinker->list);
> > > +	list_del_rcu(&shrinker->list);
> > >   	shrinker->flags &= ~SHRINKER_REGISTERED;
> > >   	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > >   		unregister_memcg_shrinker(shrinker);
> > >   	debugfs_entry = shrinker_debugfs_remove(shrinker);
> > >   	up_write(&shrinker_rwsem);
> > > +	synchronize_srcu(&shrinker_srcu);
> > > +
> > >   	debugfs_remove_recursive(debugfs_entry);
> > >   	kfree(shrinker->nr_deferred);
> > > @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
> > >   {
> > >   	down_write(&shrinker_rwsem);
> > >   	up_write(&shrinker_rwsem);
> > > +	synchronize_srcu(&shrinker_srcu);
> > >   }
> > >   EXPORT_SYMBOL(synchronize_shrinkers);
> > > @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >   {
> > >   	unsigned long ret, freed = 0;
> > >   	struct shrinker *shrinker;
> > > +	int srcu_idx;
> > >   	/*
> > >   	 * The root memcg might be allocated even though memcg is disabled
> > > @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > >   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > > -	if (!down_read_trylock(&shrinker_rwsem))
> > > -		goto out;
> > > +	srcu_idx = srcu_read_lock(&shrinker_srcu);
> > > -	list_for_each_entry(shrinker, &shrinker_list, list) {
> > > +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> > > +				 srcu_read_lock_held(&shrinker_srcu)) {
> > >   		struct shrink_control sc = {
> > >   			.gfp_mask = gfp_mask,
> > >   			.nid = nid,
> > > @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >   		if (ret == SHRINK_EMPTY)
> > >   			ret = 0;
> > >   		freed += ret;
> > > -		/*
> > > -		 * Bail out if someone want to register a new shrinker to
> > > -		 * prevent the registration from being stalled for long periods
> > > -		 * by parallel ongoing shrinking.
> > > -		 */
> > > -		if (rwsem_is_contended(&shrinker_rwsem)) {
> > > -			freed = freed ? : 1;
> > > -			break;
> > > -		}
> > >   	}
> > > -	up_read(&shrinker_rwsem);
> > > -out:
> > > +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> > >   	cond_resched();
> > >   	return freed;
> > >   }
> > > -- 
> > > 2.20.1
> > > 
> > > 
> > 
> > Hi Qi,
> > 
> > A different problem I realized after my old attempt to use SRCU was that the
> > unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
> > call. Both register_shrinker() *and* unregister_shrinker() are called frequently
> > these days, and SRCU is too unfair to the unregister path IMO.
> 
> Hi Sultan,
> 
> IIUC, for unregister_shrinker(), the wait time is hardly longer with
> SRCU than with shrinker_rwsem before.

The wait time can be quite different because with shrinker_rwsem, the
rwsem_is_contended() bailout would cause unregister_shrinker() to wait for only
one random shrinker to finish at worst rather than waiting for *all* shrinkers
to finish.

> And I just did a simple test. After using the script in cover letter to
> increase the shrink_slab hotspot, I did umount 1k times at the same
> time, and then I used bpftrace to measure the time consumption of
> unregister_shrinker() as follows:
> 
> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; }
> kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs -
> @start[tid]); delete(@start[tid]); }'
> 
> @ns[umount]:
> [16K, 32K)             3 |      |
> [32K, 64K)            66 |@@@@@@@@@@      |
> [64K, 128K)           32 |@@@@@      |
> [128K, 256K)          22 |@@@      |
> [256K, 512K)          48 |@@@@@@@      |
> [512K, 1M)            19 |@@@      |
> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
> [2M, 4M)             313
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> |
> [8M, 16M)             55 |@@@@@@@@@
> 
> I see that the highest time-consuming of unregister_shrinker() is between
> 8ms and 16ms, which feels tolerable?

If you've got a fast x86 machine then I'd say that's a bit slow. :)

This depends a lot on which shrinkers are active on your system and how much
work each one does upon running. If a driver's shrinker doesn't have much to do
because there's nothing it can shrink further, then it'll run fast. Conversely,
if a driver is stressed in a way that constantly creates a lot of potential work
for its shrinker, then the shrinker will run longer.

Since shrinkers are allowed to sleep, the delays can really add up when waiting
for all of them to finish running. In the past, I recall observing delays of
100ms+ in unregister_shrinker() on slower arm64 hardware when I stress tested
the SRCU approach.

If your GPU driver has a shrinker (such as i915), I suggest testing again under
heavy GPU load. The GPU shrinkers can be pretty heavy IIRC.

Thanks,
Sultan

> Thanks,
> Qi
> 
> > 
> > Although I never got around to submitting it, I made a non-SRCU solution [1]
> > that uses fine-grained locking instead, which is fair to both the register path
> > and unregister path. (The patch I've linked is a version of this adapted to an
> > older 4.14 kernel FYI, but it can be reworked for the current kernel.)
> > 
> > What do you think about the fine-grained locking approach?
> > 
> > Thanks,
> > Sultan
> > 
> > [1] https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a
> > 
> 
> -- 
> Thanks,
> Qi