Re: Possible regression with cgroups in 3.11

Li Zefan <lizefan@xxxxxxxxxx> · Fri, 8 Nov 2013 16:36:34 +0800

>>>>      kworker/3:5-7605  [003] ....   987.475678:
>>>> mem_cgroup_reparent_charges: usage: 1568768
>>>>      kworker/3:5-7605  [003] ....   987.478677:
>>>> mem_cgroup_reparent_charges: usage: 1568768
>>>>      kworker/3:5-7605  [003] ....   987.481675:
>>>> mem_cgroup_reparent_charges: usage: 1568768
>>>
>>> So it's much more likely this is a memcg bug rather than a cgroup bug.
>>> I hope memcg guys could look into it, or you could do a git-bisect if
>>> you can reliably reproduce the bug.
>>
>> I think there is a problem with ref counting and memcg.
>>
>> The old scheme would wait with the charge reparenting until all
>> references were gone for good, whereas the new scheme has only a RCU
>> grace period between disabling tryget and offlining the css.
>> Unfortunately, memory cgroups don't hold the rcu_read_lock() over both
>> the tryget and the res_counter charge that would make it visible to
>> offline_css(), which means that there is a possible race condition
>> between cgroup teardown and an ongoing charge:
>>
>> #0: destroy                #1: charge
>>
>>                            rcu_read_lock()
>>                            css_tryget()
>>                            rcu_read_unlock()
>> disable tryget()
>> call_rcu()
>>   offline_css()
>>     reparent_charges()
>>                            res_counter_charge()
>>                            css_put()
>>                              css_free()
>>                            pc->mem_cgroup = deadcg
>>                            add page to lru
>>
>> If the res_counter is hierarchical, there is now a leaked charge from
>> the dead group in the parent counter with no corresponding page on the
>> LRU, which will lead to this endless loop when deleting the parent.
> 
> Hugh, I bet your problem is actually the same thing, where
> reparent_charges is looping and the workqueue is not actually stuck,
> it's just that waiting for the CPU callbacks is the longest thing the
> loop does so it's most likely to show up in the trace.
> 
>> The race window can be seconds if the res_counter hits its limit and
>> page reclaim is entered between css_tryget() and the res counter
>> charge succeeding.
>>
>> I thought about calling reparent_charges() again from css_free() at
>> first to catch any raced charges.  But that won't work if the last
>> reference is actually put by the charger because then it'll drop into
>> the loop before putting the page on the LRU.
> 
> Actually, it *should* work after all, because the final css_put()
> schedules css_free() in a workqueue, which will just block until the
> charge finishes and lru-links the page.  Not the most elegant
> behavior, but hey, neither is livelocking!
> 
> So how about this?
> 

Thanks for the analysis and fix!

But we fix the bug by adding a call to mem_cgroup_reparent_charge() in
css_free() while we are dead-looping in css_offline()? We won't call
css_free() if css_offline() hasn't finished.

> ---
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
> Subject: [patch] mm: memcg: reparent charges during css_free()
> 
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: stable@xxxxxxxxxx # 3.8+

FYI, I don't think Markus and google ever experience this issue
before 3.11.

> ---
>  mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cc4f9cbe760e..3dce2b50891c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6341,7 +6341,34 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> +	/*
> +	 * XXX: css_offline() would be where we should reparent all
> +	 * memory to prepare the cgroup for destruction.  However,
> +	 * memcg does not do css_tryget() and res_counter charging
> +	 * under the same RCU lock region, which means that charging
> +	 * could race with offlining, potentially leaking charges and
> +	 * sending out pages with stale cgroup pointers:
> +	 *
> +	 * #0                        #1
> +	 *                           rcu_read_lock()
> +	 *                           css_tryget()
> +	 *                           rcu_read_unlock()
> +	 * disable css_tryget()
> +	 * call_rcu()
> +	 *   offline_css()
> +	 *     reparent_charges()
> +	 *                           res_counter_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * We still reparent most charges in offline_css() simply
> +	 * because we don't want all these pages stuck if a long-term
> +	 * reference like a swap entry is holding on to the cgroup
> +	 * past offlining, but make sure we catch any raced charges:
> +	 */
> +	mem_cgroup_reparent_charges(memcg);
>  	memcg_destroy_kmem(memcg);
>  	__mem_cgroup_free(memcg);
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html