Re: + hugetlb-cgroup-simplify-pre_destroy-callback.patch added to -mm tree

"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxxxxxxx> · Thu, 19 Jul 2012 17:51:05 +0530

Michal Hocko <mhocko@xxxxxxx> writes:

> On Wed 18-07-12 14:26:36, Andrew Morton wrote:
>> 
>> The patch titled
>>      Subject: hugetlb/cgroup: simplify pre_destroy callback
>> has been added to the -mm tree.  Its filename is
>>      hugetlb-cgroup-simplify-pre_destroy-callback.patch
>> 
>> Before you just go and hit "reply", please:
>>    a) Consider who else should be cc'ed
>>    b) Prefer to cc a suitable mailing list as well
>>    c) Ideally: find the original patch on the mailing list and do a
>>       reply-to-all to that, adding suitable additional cc's
>> 
>> *** Remember to use Documentation/SubmitChecklist when testing your code ***
>> 
>> The -mm tree is included into linux-next and is updated
>> there every 3-4 working days
>> 
>> ------------------------------------------------------
>> From: "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
>> Subject: hugetlb/cgroup: simplify pre_destroy callback
>> 
>> Since we cannot fail in hugetlb_cgroup_move_parent(), we don't really need
>> to check whether cgroup have any change left after that.  Also skip those
>> hstates for which we don't have any charge in this cgroup.
>
> IIUC this depends on a non-existent (cgroup) patch. I guess something
> like the patch at the end should address it. I haven't tested it though
> so it is not signed-off-by yet.
>
>> Based on an earlier patch from Wanpeng Li.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Wanpeng Li <liwanp@xxxxxxxxxxxxxxxxxx>
>> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>> Cc: Michal Hocko <mhocko@xxxxxxx>
>> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> ---
>> 
>>  mm/hugetlb_cgroup.c |   49 ++++++++++++++++++------------------------
>>  1 file changed, 21 insertions(+), 28 deletions(-)
>> 
>> diff -puN mm/hugetlb_cgroup.c~hugetlb-cgroup-simplify-pre_destroy-callback mm/hugetlb_cgroup.c
>> --- a/mm/hugetlb_cgroup.c~hugetlb-cgroup-simplify-pre_destroy-callback
>> +++ a/mm/hugetlb_cgroup.c
>> @@ -65,18 +65,6 @@ static inline struct hugetlb_cgroup *par
>>  	return hugetlb_cgroup_from_cgroup(cg->parent);
>>  }
>>  
>> -static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
>> -{
>> -	int idx;
>> -	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
>> -
>> -	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
>> -		if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
>> -			return true;
>> -	}
>> -	return false;
>> -}
>> -
>>  static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
>>  {
>>  	int idx;
>> @@ -159,24 +147,29 @@ static int hugetlb_cgroup_pre_destroy(st
>>  {
>>  	struct hstate *h;
>>  	struct page *page;
>> -	int ret = 0, idx = 0;
>> +	int ret = 0, idx;
>> +	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
>>  
>> -	do {
>> -		if (cgroup_task_count(cgroup) ||
>> -		    !list_empty(&cgroup->children)) {
>> -			ret = -EBUSY;
>> -			goto out;
>> -		}
>> -		for_each_hstate(h) {
>> -			spin_lock(&hugetlb_lock);
>> -			list_for_each_entry(page, &h->hugepage_activelist, lru)
>> -				hugetlb_cgroup_move_parent(idx, cgroup, page);
>>  
>> -			spin_unlock(&hugetlb_lock);
>> -			idx++;
>> -		}
>> -		cond_resched();
>> -	} while (hugetlb_cgroup_have_usage(cgroup));
>> +	if (cgroup_task_count(cgroup) ||
>> +	    !list_empty(&cgroup->children)) {
>> +		ret = -EBUSY;
>> +		goto out;
>> +	}
>> +
>> +	for_each_hstate(h) {
>> +		/*
>> +		 * if we don't have any charge, skip this hstate
>> +		 */
>> +		idx = hstate_index(h);
>> +		if (res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE) == 0)
>> +			continue;
>> +		spin_lock(&hugetlb_lock);
>> +		list_for_each_entry(page, &h->hugepage_activelist, lru)
>> +			hugetlb_cgroup_move_parent(idx, cgroup, page);
>> +		spin_unlock(&hugetlb_lock);
>> +		VM_BUG_ON(res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE));
>> +	}
>>  out:
>>  	return ret;
>>  }
>> _
>
> ---
> From 621ed1c9dab63bd82205bd5266eb9974f86a0a3f Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@xxxxxxx>
> Date: Thu, 19 Jul 2012 13:23:23 +0200
> Subject: [PATCH] cgroup: keep cgroup_mutex locked for pre_destroy
>
> 3fa59dfb (cgroup: fix potential deadlock in pre_destroy) dropped the
> cgroup_mutex lock while calling pre_destroy callbacks because memory
> controller could deadlock because force_empty triggered reclaim.
> Since "memcg: move charges to root cgroup if use_hierarchy=0" there is
> no reclaim going on from mem_cgroup_force_empty though so we can safely
> keep the cgroup_mutex locked. This has an advantage that no tasks might
> be added during pre_destroy callback and so the handlers don't have to
> consider races when new tasks add new charges. This simplifies the
> implementation.
> ---
>  kernel/cgroup.c |    2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 0f3527d..9dba05d 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4181,7 +4181,6 @@ again:
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> -	mutex_unlock(&cgroup_mutex);
>
>  	/*
>  	 * In general, subsystem has no css->refcnt after pre_destroy(). But
> @@ -4204,7 +4203,6 @@ again:
>  		return ret;
>  	}
>
> -	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
>  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);

mem_cgroup_force_empty still calls 

lru_add_drain_all 
   ->schedule_on_each_cpu
        -> get_online_cpus
           ->mutex_lock(&cpu_hotplug.lock);

So wont we deadlock ?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>