Re: Possible regression with cgroups in 3.11

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 7 Nov 2013 19:14:37 -0500

On Thu, Nov 07, 2013 at 06:53:01PM -0500, Johannes Weiner wrote:
> On Tue, Nov 05, 2013 at 05:01:37PM +0800, Li Zefan wrote:
> > On 2013/11/4 21:43, Markus Blank-Burian wrote:
> > >> synchronize_rcu() is a block operation and can keep us waiting for
> > >> a long period, so instead it's possible that usage never goes down
> > >> to 0 and we are in a dead loop.
> > > 
> > > Ok, I didn't think of that. Tracing shows that the function keeps
> > > looping. The last lines repeat indefinitely.
> > > 
> > ...
> > >      kworker/3:5-7605  [003] ....   987.475678:
> > > mem_cgroup_reparent_charges: usage: 1568768
> > >      kworker/3:5-7605  [003] ....   987.478677:
> > > mem_cgroup_reparent_charges: usage: 1568768
> > >      kworker/3:5-7605  [003] ....   987.481675:
> > > mem_cgroup_reparent_charges: usage: 1568768
> > 
> > So it's much more likely this is a memcg bug rather than a cgroup bug.
> > I hope memcg guys could look into it, or you could do a git-bisect if
> > you can reliably reproduce the bug.
> 
> I think there is a problem with ref counting and memcg.
> 
> The old scheme would wait with the charge reparenting until all
> references were gone for good, whereas the new scheme has only a RCU
> grace period between disabling tryget and offlining the css.
> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
> the tryget and the res_counter charge that would make it visible to
> offline_css(), which means that there is a possible race condition
> between cgroup teardown and an ongoing charge:
> 
> #0: destroy                #1: charge
> 
>                            rcu_read_lock()
>                            css_tryget()
>                            rcu_read_unlock()
> disable tryget()
> call_rcu()
>   offline_css()
>     reparent_charges()
>                            res_counter_charge()
>                            css_put()
>                              css_free()
>                            pc->mem_cgroup = deadcg
>                            add page to lru
> 
> If the res_counter is hierarchical, there is now a leaked charge from
> the dead group in the parent counter with no corresponding page on the
> LRU, which will lead to this endless loop when deleting the parent.

Hugh, I bet your problem is actually the same thing, where
reparent_charges is looping and the workqueue is not actually stuck,
it's just that waiting for the CPU callbacks is the longest thing the
loop does so it's most likely to show up in the trace.

> The race window can be seconds if the res_counter hits its limit and
> page reclaim is entered between css_tryget() and the res counter
> charge succeeding.
> 
> I thought about calling reparent_charges() again from css_free() at
> first to catch any raced charges.  But that won't work if the last
> reference is actually put by the charger because then it'll drop into
> the loop before putting the page on the LRU.

Actually, it *should* work after all, because the final css_put()
schedules css_free() in a workqueue, which will just block until the
charge finishes and lru-links the page.  Not the most elegant
behavior, but hey, neither is livelocking!

So how about this?

---
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Subject: [patch] mm: memcg: reparent charges during css_free()

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: stable@xxxxxxxxxx # 3.8+
---
 mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cc4f9cbe760e..3dce2b50891c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-
+	/*
+	 * XXX: css_offline() would be where we should reparent all
+	 * memory to prepare the cgroup for destruction.  However,
+	 * memcg does not do css_tryget() and res_counter charging
+	 * under the same RCU lock region, which means that charging
+	 * could race with offlining, potentially leaking charges and
+	 * sending out pages with stale cgroup pointers:
+	 *
+	 * #0                        #1
+	 *                           rcu_read_lock()
+	 *                           css_tryget()
+	 *                           rcu_read_unlock()
+	 * disable css_tryget()
+	 * call_rcu()
+	 *   offline_css()
+	 *     reparent_charges()
+	 *                           res_counter_charge()
+	 *                           css_put()
+	 *                             css_free()
+	 *                           pc->mem_cgroup = dead memcg
+	 *                           add page to lru
+	 *
+	 * We still reparent most charges in offline_css() simply
+	 * because we don't want all these pages stuck if a long-term
+	 * reference like a swap entry is holding on to the cgroup
+	 * past offlining, but make sure we catch any raced charges:
+	 */
+	mem_cgroup_reparent_charges(memcg);
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
 }
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html