On Thu, Aug 16 2012, Glauber Costa wrote: > A lot of the initialization we do in mem_cgroup_create() is done with > softirqs enabled. This include grabbing a css id, which holds > &ss->id_lock->rlock, and the per-zone trees, which holds > rtpz->lock->rlock. All of those signal to the lockdep mechanism that > those locks can be used in SOFTIRQ-ON-W context. This means that the > freeing of memcg structure must happen in a compatible context, > otherwise we'll get a deadlock. > > The reference counting mechanism we use allows the memcg structure to be > freed later and outlive the actual memcg destruction from the > filesystem. However, we have little, if any, means to guarantee in which > context the last memcg_put will happen. The best we can do is test it > and try to make sure no invalid context releases are happening. But as > we add more code to memcg, the possible interactions grow in number and > expose more ways to get context conflicts. > > Greg Thelen reported a bug with that patchset applied that would trigger > if a task would hold a reference to a memcg through its kmem counter. > This would mean that killing that task would eventually get us to > __mem_cgroup_free() after dropping the last kernel page reference, in an > invalid IN-SOFTIRQ-W. > > Besides that, he raised the quite valid concern that keeping the full > memcg around for an unbounded period of time can eventually exhaust the > css_id space, and pin a lot of not needed memory. For instance, a > O(nr_cpus) percpu data for the stats is kept around, and we don't expect > to use it after the memcg is gone. > > Both those problems can be avoided by freeing as much as we can in > mem_cgroup_destroy(), and leaving only the memcg structure and the > static branches to be removed later. That freeing run on a predictable > context, getting rid of the softirq problem, and also reduces pressure > both on the css_id space and total dangling memory. Thank you for the patch. I think it is a step in the right direction, but I suspect a problem (noted below). > I consider this safe because all the page_cgroup references to user > pages are reparented to the imediate parent, so late uncharges won't > trigger the common uncharge paths with a destroyed memcg. > > Although we don't migrate kernel pages to parent, we also don't call the > common uncharge paths for those pages, rather uncharging the > res_counters directly. So we are safe on this side of the wall as well. > > Signed-off-by: Glauber Costa <glommer@xxxxxxxxxxxxx> > Reported-by: Greg Thelen <gthelen@xxxxxxxxxx> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > CC: Michal Hocko <mhocko@xxxxxxx> > CC: Johannes Weiner <hannes@xxxxxxxxxxx> > --- > mm/memcontrol.c | 25 ++++++++++++------------- > 1 file changed, 12 insertions(+), 13 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 9a82965..78cb394 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5169,18 +5169,9 @@ static void free_work(struct work_struct *work) > vfree(memcg); > } > > -static void free_rcu(struct rcu_head *rcu_head) > -{ > - struct mem_cgroup *memcg; > - > - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); > - INIT_WORK(&memcg->work_freeing, free_work); > - schedule_work(&memcg->work_freeing); > -} > - > /* > - * At destroying mem_cgroup, references from swap_cgroup can remain. > - * (scanning all at force_empty is too costly...) > + * At destroying mem_cgroup, references from swap_cgroup and other places can > + * remain. (scanning all at force_empty is too costly...) > * > * Instead of clearing all references at force_empty, we remember > * the number of reference from swap_cgroup and free mem_cgroup when > @@ -5188,6 +5179,14 @@ static void free_rcu(struct rcu_head *rcu_head) > * > * Removal of cgroup itself succeeds regardless of refs from swap. > */ > +static void free_rcu(struct rcu_head *rcu_head) > +{ > + struct mem_cgroup *memcg; > + > + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); > + INIT_WORK(&memcg->work_freeing, free_work); > + schedule_work(&memcg->work_freeing); > +} > > static void __mem_cgroup_free(struct mem_cgroup *memcg) > { > @@ -5200,7 +5199,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) > free_mem_cgroup_per_zone_info(memcg, node); > > free_percpu(memcg->stat); > - call_rcu(&memcg->rcu_freeing, free_rcu); > } > > static void mem_cgroup_get(struct mem_cgroup *memcg) > @@ -5212,7 +5210,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) > { > if (atomic_sub_and_test(count, &memcg->refcnt)) { > struct mem_cgroup *parent = parent_mem_cgroup(memcg); > - __mem_cgroup_free(memcg); > + call_rcu(&memcg->rcu_freeing, free_rcu); > if (parent) > mem_cgroup_put(parent); > } > @@ -5377,6 +5375,7 @@ static void mem_cgroup_destroy(struct cgroup *cont) > > kmem_cgroup_destroy(memcg); > > + __mem_cgroup_free(memcg); I suspect that this will free the css_id before all swap entries have dropped their references with mem_cgroup_uncharge_swap() ? I think we only want to call __mem_cgroup_free() once all non kernel page references have been released. This would include mem_cgroup_destroy() and any pending calls to mem_cgroup_uncharge_swap(). I'm not sure, but may be a second refcount or some optimization with the kmem_accounted bitmask can efficiently handle this. > mem_cgroup_put(memcg); > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>