Re: [PATCH] mm: slowly shrink slabs with a relatively small number of objects

Roman Gushchin <guro@xxxxxx> · Tue, 4 Sep 2018 08:34:49 -0700

On Tue, Sep 04, 2018 at 09:00:05AM +0200, Michal Hocko wrote:
> On Mon 03-09-18 13:28:06, Roman Gushchin wrote:
> > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote:
> > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote:
> > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote:
> > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote:
> > > > > 
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index fa2c150ab7b9..c910cf6bf606 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct
> > > > > > shrink_control *shrinkctl,
> > > > > >  	delta = freeable >> priority;
> > > > > >  	delta *= 4;
> > > > > >  	do_div(delta, shrinker->seeks);
> > > > > > +
> > > > > > +	if (delta == 0 && freeable > 0)
> > > > > > +		delta = min(freeable, batch_size);
> > > > > > +
> > > > > >  	total_scan += delta;
> > > > > >  	if (total_scan < 0) {
> > > > > >  		pr_err("shrink_slab: %pF negative objects to delete
> > > > > > nr=%ld\n",
> > > > > 
> > > > > I agree that we need to shrink slabs with fewer than
> > > > > 4096 objects, but do we want to put more pressure on
> > > > > a slab the moment it drops below 4096 than we applied
> > > > > when it had just over 4096 objects on it?
> > > > > 
> > > > > With this patch, a slab with 5000 objects on it will
> > > > > get 1 item scanned, while a slab with 4000 objects on
> > > > > it will see shrinker->batch or SHRINK_BATCH objects
> > > > > scanned every time.
> > > > > 
> > > > > I don't know if this would cause any issues, just
> > > > > something to ponder.
> > > > 
> > > > Hm, fair enough. So, basically we can always do
> > > > 
> > > >     delta = max(delta, min(freeable, batch_size));
> > > > 
> > > > Does it look better?
> > > 
> > > Why don't you use the same heuristic we use for the normal LRU raclaim?
> > 
> > Because we do reparent kmem lru lists on offlining.
> > Take a look at memcg_offline_kmem().
> 
> Then I must be missing something. Why are we growing the number of dead
> cgroups then?

We do reparent LRU lists, but not objects. Objects (or, more precisely, pages)
are still holding a reference to the memcg.

Let's say we have systemd periodically restarting some service in system.slice.
Then all accounted objects after removal of the service's memcg are placed
into the system.slice's LRU. But under small/moderate memory pressure we
won't scan it at all, unless it get's really big. If there is "only" a couple
of thousands of objects, we don't scan it, and can easily have several hundreds
of pinned dying cgroups.