Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 5 Jun 2013 10:39:49 -0400

On Wed, Jun 05, 2013 at 01:20:23AM -0700, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote:
> > > I don't really get that.  As long as the amount is bound and the
> > > overhead negligible / acceptable, why does it matter how long the
> > > pinning persists? 
> > 
> > Because the amount is not bound either. Just create a hierarchy and
> > trigger the hard limit and if you are careful enough you can always keep
> > some of the children in the cached pointer (with css reference, if you
> > will) and then release the hierarchy. You can do that repeatedly and
> > leak considerable amount of memory.
> 
> It's still bound, no?  Each live memcg can only keep limited number of
> cgroups cached, right?

It is bounded by the number of memcgs.  Each one can have 12
(DEF_PRIORITY) references.

> > > We aren't talking about something gigantic or can
> > 
> > mem_cgroup is 888B now (depending on configuration). So I wouldn't call
> > it negligible.
> 
> Do you think that the number can actually grow harmful?  Would you be
> kind enough to share some calculations with me?

5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M
of dead struct mem_cgroup, plus whatever else the css pins.

> > > In the off chance that this is a real problem, which I strongly doubt,
> > > as I wrote to Johannes, we can implement extremely dumb cleanup
> > > routine rather than this weak reference beast.
> > 
> > That was my first version (https://lkml.org/lkml/2013/1/3/298) and
> > Johannes didn't like. To be honest I do not care _much_ which way we go
> > but we definitely cannot pin those objects for ever.
> 
> I'll get to the barrier thread but really complex barrier dancing like
> that is only justifiable in extremely hot paths a lot of people pay
> attention to.  It doesn't belong inside memcg proper.  If the cached
> amount is an actual concern, let's please implement a simple clean up
> thing.  All we need is a single delayed_work which scans the tree
> periodically.
> 
> Johannes, what do you think?

While I see your concerns about complexity (and this certainly is not
the most straight-forward code), I really can't get too excited about
asynchroneous garbage collection, even worse when it's time-based. It
would probably start out with less code but two releases later we
would have added all this stuff that's required to get the interaction
right and fix unpredictable reclaim disruption that hits when the
reaper coincides just right with heavy reclaim once a week etc.  I
just don't think that asynchroneous models are simpler than state
machines.  Harder to reason about, harder to debug.

Now, there are separate things that add complexity to our current
code: the weak pointers, the lockless iterator, and the fact that all
of it is jam-packed into one monolithic iterator function.  I can see
why you are not happy.  But that does not mean we have to get rid of
everything wholesale.

You hate the barriers, so let's add a lock to access the iterator.
That path is not too hot in most cases.

On the other hand, the weak pointer is not too esoteric of a pattern
and we can neatly abstract it into two functions: one that takes an
iterator and returns a verified css reference or NULL, and one to
invalidate pointers when called from the memcg destruction code.

These two things should greatly simplify mem_cgroup_iter() while not
completely abandoning all our optimizations.

What do you think?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>