>>>> kworker/3:5-7605 [003] .... 987.475678: >>>> mem_cgroup_reparent_charges: usage: 1568768 >>>> kworker/3:5-7605 [003] .... 987.478677: >>>> mem_cgroup_reparent_charges: usage: 1568768 >>>> kworker/3:5-7605 [003] .... 987.481675: >>>> mem_cgroup_reparent_charges: usage: 1568768 >>> >>> So it's much more likely this is a memcg bug rather than a cgroup bug. >>> I hope memcg guys could look into it, or you could do a git-bisect if >>> you can reliably reproduce the bug. >> >> I think there is a problem with ref counting and memcg. >> >> The old scheme would wait with the charge reparenting until all >> references were gone for good, whereas the new scheme has only a RCU >> grace period between disabling tryget and offlining the css. >> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both >> the tryget and the res_counter charge that would make it visible to >> offline_css(), which means that there is a possible race condition >> between cgroup teardown and an ongoing charge: >> >> #0: destroy #1: charge >> >> rcu_read_lock() >> css_tryget() >> rcu_read_unlock() >> disable tryget() >> call_rcu() >> offline_css() >> reparent_charges() >> res_counter_charge() >> css_put() >> css_free() >> pc->mem_cgroup = deadcg >> add page to lru >> >> If the res_counter is hierarchical, there is now a leaked charge from >> the dead group in the parent counter with no corresponding page on the >> LRU, which will lead to this endless loop when deleting the parent. > > Hugh, I bet your problem is actually the same thing, where > reparent_charges is looping and the workqueue is not actually stuck, > it's just that waiting for the CPU callbacks is the longest thing the > loop does so it's most likely to show up in the trace. > >> The race window can be seconds if the res_counter hits its limit and >> page reclaim is entered between css_tryget() and the res counter >> charge succeeding. >> >> I thought about calling reparent_charges() again from css_free() at >> first to catch any raced charges. But that won't work if the last >> reference is actually put by the charger because then it'll drop into >> the loop before putting the page on the LRU. > > Actually, it *should* work after all, because the final css_put() > schedules css_free() in a workqueue, which will just block until the > charge finishes and lru-links the page. Not the most elegant > behavior, but hey, neither is livelocking! > > So how about this? > Thanks for the analysis and fix! But we fix the bug by adding a call to mem_cgroup_reparent_charge() in css_free() while we are dead-looping in css_offline()? We won't call css_free() if css_offline() hasn't finished. > --- > From: Johannes Weiner <hannes@xxxxxxxxxxx> > Subject: [patch] mm: memcg: reparent charges during css_free() > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: stable@xxxxxxxxxx # 3.8+ FYI, I don't think Markus and google ever experience this issue before 3.11. > --- > mm/memcontrol.c | 29 ++++++++++++++++++++++++++++- > 1 file changed, 28 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cc4f9cbe760e..3dce2b50891c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) > static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(css); > - > + /* > + * XXX: css_offline() would be where we should reparent all > + * memory to prepare the cgroup for destruction. However, > + * memcg does not do css_tryget() and res_counter charging > + * under the same RCU lock region, which means that charging > + * could race with offlining, potentially leaking charges and > + * sending out pages with stale cgroup pointers: > + * > + * #0 #1 > + * rcu_read_lock() > + * css_tryget() > + * rcu_read_unlock() > + * disable css_tryget() > + * call_rcu() > + * offline_css() > + * reparent_charges() > + * res_counter_charge() > + * css_put() > + * css_free() > + * pc->mem_cgroup = dead memcg > + * add page to lru > + * > + * We still reparent most charges in offline_css() simply > + * because we don't want all these pages stuck if a long-term > + * reference like a swap entry is holding on to the cgroup > + * past offlining, but make sure we catch any raced charges: > + */ > + mem_cgroup_reparent_charges(memcg); > memcg_destroy_kmem(memcg); > __mem_cgroup_free(memcg); > } > -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html