On 08/17/2012 08:51 AM, Greg Thelen wrote: > On Thu, Aug 16 2012, Glauber Costa wrote: > >> A lot of the initialization we do in mem_cgroup_create() is done with >> softirqs enabled. This include grabbing a css id, which holds >> &ss->id_lock->rlock, and the per-zone trees, which holds >> rtpz->lock->rlock. All of those signal to the lockdep mechanism that >> those locks can be used in SOFTIRQ-ON-W context. This means that the >> freeing of memcg structure must happen in a compatible context, >> otherwise we'll get a deadlock. >> >> The reference counting mechanism we use allows the memcg structure to be >> freed later and outlive the actual memcg destruction from the >> filesystem. However, we have little, if any, means to guarantee in which >> context the last memcg_put will happen. The best we can do is test it >> and try to make sure no invalid context releases are happening. But as >> we add more code to memcg, the possible interactions grow in number and >> expose more ways to get context conflicts. >> >> Greg Thelen reported a bug with that patchset applied that would trigger >> if a task would hold a reference to a memcg through its kmem counter. >> This would mean that killing that task would eventually get us to >> __mem_cgroup_free() after dropping the last kernel page reference, in an >> invalid IN-SOFTIRQ-W. >> >> Besides that, he raised the quite valid concern that keeping the full >> memcg around for an unbounded period of time can eventually exhaust the >> css_id space, and pin a lot of not needed memory. For instance, a >> O(nr_cpus) percpu data for the stats is kept around, and we don't expect >> to use it after the memcg is gone. >> >> Both those problems can be avoided by freeing as much as we can in >> mem_cgroup_destroy(), and leaving only the memcg structure and the >> static branches to be removed later. That freeing run on a predictable >> context, getting rid of the softirq problem, and also reduces pressure >> both on the css_id space and total dangling memory. > > Thank you for the patch. I think it is a step in the right direction, > but I suspect a problem (noted below). > >> I consider this safe because all the page_cgroup references to user >> pages are reparented to the imediate parent, so late uncharges won't >> trigger the common uncharge paths with a destroyed memcg. >> >> Although we don't migrate kernel pages to parent, we also don't call the >> common uncharge paths for those pages, rather uncharging the >> res_counters directly. So we are safe on this side of the wall as well. >> >> Signed-off-by: Glauber Costa <glommer@xxxxxxxxxxxxx> >> Reported-by: Greg Thelen <gthelen@xxxxxxxxxx> >> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> >> CC: Michal Hocko <mhocko@xxxxxxx> >> CC: Johannes Weiner <hannes@xxxxxxxxxxx> >> --- >> mm/memcontrol.c | 25 ++++++++++++------------- >> 1 file changed, 12 insertions(+), 13 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 9a82965..78cb394 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -5169,18 +5169,9 @@ static void free_work(struct work_struct *work) >> vfree(memcg); >> } >> >> -static void free_rcu(struct rcu_head *rcu_head) >> -{ >> - struct mem_cgroup *memcg; >> - >> - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); >> - INIT_WORK(&memcg->work_freeing, free_work); >> - schedule_work(&memcg->work_freeing); >> -} >> - >> /* >> - * At destroying mem_cgroup, references from swap_cgroup can remain. >> - * (scanning all at force_empty is too costly...) >> + * At destroying mem_cgroup, references from swap_cgroup and other places can >> + * remain. (scanning all at force_empty is too costly...) >> * >> * Instead of clearing all references at force_empty, we remember >> * the number of reference from swap_cgroup and free mem_cgroup when >> @@ -5188,6 +5179,14 @@ static void free_rcu(struct rcu_head *rcu_head) >> * >> * Removal of cgroup itself succeeds regardless of refs from swap. >> */ >> +static void free_rcu(struct rcu_head *rcu_head) >> +{ >> + struct mem_cgroup *memcg; >> + >> + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); >> + INIT_WORK(&memcg->work_freeing, free_work); >> + schedule_work(&memcg->work_freeing); >> +} >> >> static void __mem_cgroup_free(struct mem_cgroup *memcg) >> { >> @@ -5200,7 +5199,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) >> free_mem_cgroup_per_zone_info(memcg, node); >> >> free_percpu(memcg->stat); >> - call_rcu(&memcg->rcu_freeing, free_rcu); >> } >> >> static void mem_cgroup_get(struct mem_cgroup *memcg) >> @@ -5212,7 +5210,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) >> { >> if (atomic_sub_and_test(count, &memcg->refcnt)) { >> struct mem_cgroup *parent = parent_mem_cgroup(memcg); >> - __mem_cgroup_free(memcg); >> + call_rcu(&memcg->rcu_freeing, free_rcu); >> if (parent) >> mem_cgroup_put(parent); >> } >> @@ -5377,6 +5375,7 @@ static void mem_cgroup_destroy(struct cgroup *cont) >> >> kmem_cgroup_destroy(memcg); >> >> + __mem_cgroup_free(memcg); > > I suspect that this will free the css_id before all swap entries have > dropped their references with mem_cgroup_uncharge_swap() ? I think we > only want to call __mem_cgroup_free() once all non kernel page > references have been released. This would include mem_cgroup_destroy() > and any pending calls to mem_cgroup_uncharge_swap(). I'm not sure, but > may be a second refcount or some optimization with the kmem_accounted > bitmask can efficiently handle this. > Can we demonstrate that? I agree there might be a potential problem, and that is why I sent this separately. But the impression I got after testing and reading the code, was that the memcg information in pc->mem_cgroup would be updated to the parent. This means that any later call to uncharge or uncharge_swap would just uncharge from the parent memcg and we'd have no problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>