On Tue, Nov 05, 2013 at 05:01:37PM +0800, Li Zefan wrote: > On 2013/11/4 21:43, Markus Blank-Burian wrote: > >> synchronize_rcu() is a block operation and can keep us waiting for > >> a long period, so instead it's possible that usage never goes down > >> to 0 and we are in a dead loop. > > > > Ok, I didn't think of that. Tracing shows that the function keeps > > looping. The last lines repeat indefinitely. > > > ... > > kworker/3:5-7605 [003] .... 987.475678: > > mem_cgroup_reparent_charges: usage: 1568768 > > kworker/3:5-7605 [003] .... 987.478677: > > mem_cgroup_reparent_charges: usage: 1568768 > > kworker/3:5-7605 [003] .... 987.481675: > > mem_cgroup_reparent_charges: usage: 1568768 > > So it's much more likely this is a memcg bug rather than a cgroup bug. > I hope memcg guys could look into it, or you could do a git-bisect if > you can reliably reproduce the bug. I think there is a problem with ref counting and memcg. The old scheme would wait with the charge reparenting until all references were gone for good, whereas the new scheme has only a RCU grace period between disabling tryget and offlining the css. Unfortunately, memory cgroups don't hold the rcu_read_lock() over both the tryget and the res_counter charge that would make it visible to offline_css(), which means that there is a possible race condition between cgroup teardown and an ongoing charge: #0: destroy #1: charge rcu_read_lock() css_tryget() rcu_read_unlock() disable tryget() call_rcu() offline_css() reparent_charges() res_counter_charge() css_put() css_free() pc->mem_cgroup = deadcg add page to lru If the res_counter is hierarchical, there is now a leaked charge from the dead group in the parent counter with no corresponding page on the LRU, which will lead to this endless loop when deleting the parent. The race window can be seconds if the res_counter hits its limit and page reclaim is entered between css_tryget() and the res counter charge succeeding. I thought about calling reparent_charges() again from css_free() at first to catch any raced charges. But that won't work if the last reference is actually put by the charger because then it'll drop into the loop before putting the page on the LRU. The lifetime management in memory cgroups is a disaster and it's going to require some thought to fix. Even before the cgroups rewrite, swapin accounting was prone to this race condition because a task from a completely different cgroup can start charging a swap-in page against the cgroup that owned the page on swapout, a cgroup that might be exiting and had been found to have no tasks, no child groups, and no outstanding references anymore. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html