A lot of the initialization we do in mem_cgroup_create() is done with softirqs enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This means that the freeing of memcg structure must happen in a compatible context, otherwise we'll get a deadlock. The reference counting mechanism we use allows the memcg structure to be freed later and outlive the actual memcg destruction from the filesystem. However, we have little, if any, means to guarantee in which context the last memcg_put will happen. The best we can do is test it and try to make sure no invalid context releases are happening. But as we add more code to memcg, the possible interactions grow in number and expose more ways to get context conflicts. Greg Thelen reported a bug with that patchset applied that would trigger if a task would hold a reference to a memcg through its kmem counter. This would mean that killing that task would eventually get us to __mem_cgroup_free() after dropping the last kernel page reference, in an invalid IN-SOFTIRQ-W. Besides that, he raised the quite valid concern that keeping the full memcg around for an unbounded period of time can eventually exhaust the css_id space, and pin a lot of not needed memory. For instance, a O(nr_cpus) percpu data for the stats is kept around, and we don't expect to use it after the memcg is gone. Both those problems can be avoided by freeing as much as we can in mem_cgroup_destroy(), and leaving only the memcg structure and the static branches to be removed later. That freeing run on a predictable context, getting rid of the softirq problem, and also reduces pressure both on the css_id space and total dangling memory. I consider this safe because all the page_cgroup references to user pages are reparented to the imediate parent, so late uncharges won't trigger the common uncharge paths with a destroyed memcg. Although we don't migrate kernel pages to parent, we also don't call the common uncharge paths for those pages, rather uncharging the res_counters directly. So we are safe on this side of the wall as well. Signed-off-by: Glauber Costa <glommer@xxxxxxxxxxxxx> Reported-by: Greg Thelen <gthelen@xxxxxxxxxx> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> CC: Michal Hocko <mhocko@xxxxxxx> CC: Johannes Weiner <hannes@xxxxxxxxxxx> --- mm/memcontrol.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9a82965..78cb394 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5169,18 +5169,9 @@ static void free_work(struct work_struct *work) vfree(memcg); } -static void free_rcu(struct rcu_head *rcu_head) -{ - struct mem_cgroup *memcg; - - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); - INIT_WORK(&memcg->work_freeing, free_work); - schedule_work(&memcg->work_freeing); -} - /* - * At destroying mem_cgroup, references from swap_cgroup can remain. - * (scanning all at force_empty is too costly...) + * At destroying mem_cgroup, references from swap_cgroup and other places can + * remain. (scanning all at force_empty is too costly...) * * Instead of clearing all references at force_empty, we remember * the number of reference from swap_cgroup and free mem_cgroup when @@ -5188,6 +5179,14 @@ static void free_rcu(struct rcu_head *rcu_head) * * Removal of cgroup itself succeeds regardless of refs from swap. */ +static void free_rcu(struct rcu_head *rcu_head) +{ + struct mem_cgroup *memcg; + + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); + INIT_WORK(&memcg->work_freeing, free_work); + schedule_work(&memcg->work_freeing); +} static void __mem_cgroup_free(struct mem_cgroup *memcg) { @@ -5200,7 +5199,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) free_mem_cgroup_per_zone_info(memcg, node); free_percpu(memcg->stat); - call_rcu(&memcg->rcu_freeing, free_rcu); } static void mem_cgroup_get(struct mem_cgroup *memcg) @@ -5212,7 +5210,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) { if (atomic_sub_and_test(count, &memcg->refcnt)) { struct mem_cgroup *parent = parent_mem_cgroup(memcg); - __mem_cgroup_free(memcg); + call_rcu(&memcg->rcu_freeing, free_rcu); if (parent) mem_cgroup_put(parent); } @@ -5377,6 +5375,7 @@ static void mem_cgroup_destroy(struct cgroup *cont) kmem_cgroup_destroy(memcg); + __mem_cgroup_free(memcg); mem_cgroup_put(memcg); } -- 1.7.11.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>