Re: Possible regression with cgroups in 3.11

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 7 Nov 2013 18:53:01 -0500

On Tue, Nov 05, 2013 at 05:01:37PM +0800, Li Zefan wrote:
> On 2013/11/4 21:43, Markus Blank-Burian wrote:
> >> synchronize_rcu() is a block operation and can keep us waiting for
> >> a long period, so instead it's possible that usage never goes down
> >> to 0 and we are in a dead loop.
> > 
> > Ok, I didn't think of that. Tracing shows that the function keeps
> > looping. The last lines repeat indefinitely.
> > 
> ...
> >      kworker/3:5-7605  [003] ....   987.475678:
> > mem_cgroup_reparent_charges: usage: 1568768
> >      kworker/3:5-7605  [003] ....   987.478677:
> > mem_cgroup_reparent_charges: usage: 1568768
> >      kworker/3:5-7605  [003] ....   987.481675:
> > mem_cgroup_reparent_charges: usage: 1568768
> 
> So it's much more likely this is a memcg bug rather than a cgroup bug.
> I hope memcg guys could look into it, or you could do a git-bisect if
> you can reliably reproduce the bug.

I think there is a problem with ref counting and memcg.

The old scheme would wait with the charge reparenting until all
references were gone for good, whereas the new scheme has only a RCU
grace period between disabling tryget and offlining the css.
Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
the tryget and the res_counter charge that would make it visible to
offline_css(), which means that there is a possible race condition
between cgroup teardown and an ongoing charge:

#0: destroy                #1: charge

                           rcu_read_lock()
                           css_tryget()
                           rcu_read_unlock()
disable tryget()
call_rcu()
  offline_css()
    reparent_charges()
                           res_counter_charge()
                           css_put()
                             css_free()
                           pc->mem_cgroup = deadcg
                           add page to lru

If the res_counter is hierarchical, there is now a leaked charge from
the dead group in the parent counter with no corresponding page on the
LRU, which will lead to this endless loop when deleting the parent.

The race window can be seconds if the res_counter hits its limit and
page reclaim is entered between css_tryget() and the res counter
charge succeeding.

I thought about calling reparent_charges() again from css_free() at
first to catch any raced charges.  But that won't work if the last
reference is actually put by the charger because then it'll drop into
the loop before putting the page on the LRU.

The lifetime management in memory cgroups is a disaster and it's going
to require some thought to fix.  Even before the cgroups rewrite,
swapin accounting was prone to this race condition because a task from
a completely different cgroup can start charging a swap-in page
against the cgroup that owned the page on swapout, a cgroup that might
be exiting and had been found to have no tasks, no child groups, and
no outstanding references anymore.
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html