On Wed, Jun 05, 2013 at 01:20:23AM -0700, Tejun Heo wrote: > Hello, Michal. > > On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote: > > > I don't really get that. As long as the amount is bound and the > > > overhead negligible / acceptable, why does it matter how long the > > > pinning persists? > > > > Because the amount is not bound either. Just create a hierarchy and > > trigger the hard limit and if you are careful enough you can always keep > > some of the children in the cached pointer (with css reference, if you > > will) and then release the hierarchy. You can do that repeatedly and > > leak considerable amount of memory. > > It's still bound, no? Each live memcg can only keep limited number of > cgroups cached, right? It is bounded by the number of memcgs. Each one can have 12 (DEF_PRIORITY) references. > > > We aren't talking about something gigantic or can > > > > mem_cgroup is 888B now (depending on configuration). So I wouldn't call > > it negligible. > > Do you think that the number can actually grow harmful? Would you be > kind enough to share some calculations with me? 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M of dead struct mem_cgroup, plus whatever else the css pins. > > > In the off chance that this is a real problem, which I strongly doubt, > > > as I wrote to Johannes, we can implement extremely dumb cleanup > > > routine rather than this weak reference beast. > > > > That was my first version (https://lkml.org/lkml/2013/1/3/298) and > > Johannes didn't like. To be honest I do not care _much_ which way we go > > but we definitely cannot pin those objects for ever. > > I'll get to the barrier thread but really complex barrier dancing like > that is only justifiable in extremely hot paths a lot of people pay > attention to. It doesn't belong inside memcg proper. If the cached > amount is an actual concern, let's please implement a simple clean up > thing. All we need is a single delayed_work which scans the tree > periodically. > > Johannes, what do you think? While I see your concerns about complexity (and this certainly is not the most straight-forward code), I really can't get too excited about asynchroneous garbage collection, even worse when it's time-based. It would probably start out with less code but two releases later we would have added all this stuff that's required to get the interaction right and fix unpredictable reclaim disruption that hits when the reaper coincides just right with heavy reclaim once a week etc. I just don't think that asynchroneous models are simpler than state machines. Harder to reason about, harder to debug. Now, there are separate things that add complexity to our current code: the weak pointers, the lockless iterator, and the fact that all of it is jam-packed into one monolithic iterator function. I can see why you are not happy. But that does not mean we have to get rid of everything wholesale. You hate the barriers, so let's add a lock to access the iterator. That path is not too hot in most cases. On the other hand, the weak pointer is not too esoteric of a pattern and we can neatly abstract it into two functions: one that takes an iterator and returns a verified css reference or NULL, and one to invalidate pointers when called from the memcg destruction code. These two things should greatly simplify mem_cgroup_iter() while not completely abandoning all our optimizations. What do you think? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>