Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner

KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> · Thu, 14 Jul 2011 20:50:12 +0900

On Thu, 14 Jul 2011 13:30:09 +0200
Michal Hocko <mhocko@xxxxxxx> wrote:

> What about this? Just compile tested:
> --- 
> From 90ab974eb69c61c2e3b94beabe9b6745fa319936 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@xxxxxxx>
> Date: Wed, 13 Jul 2011 13:05:49 +0200
> Subject: [PATCH] memcg: make oom_lock 0 and 1 based rather than coutner
> 
> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
> 
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
> 
> Consider a process (call it A) which gets the oom_lock (the first one
> that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> other processes are blocked on the mutex.
> While A releases the mutex and calls mem_cgroup_out_of_memory others
> will wake up (one after another) and increase the counter and fall into
> sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> takes the mutex again and decreases oom_lock and wakes other tasks (if
> releasing memory of the killed task hasn't done it yet).
> The main problem here is that everybody still race for the mutex and
> there is no guarantee that we will get counter back to 0 for those
> that got back to mem_cgroup_handle_oom. In the end the whole convoy
> in/decreases the counter but we do not get to 1 that would enable
> killing so nothing useful is going on.
> The time is basically unbounded because it highly depends on scheduling
> and ordering on mutex.
> 
> This patch replaces the counter by a simple {un}lock semantic. We are
> using only 0 and 1 to distinguish those two states.
> As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
> to make sure that nobody else races with us which is guaranteed by the
> memcg_oom_mutex. All other consumers just read the value atomically for
> a single group which is sufficient because we set the value atomically.
> mem_cgroup_oom_lock has to be really careful because we might be in
> higher in a hierarchy than already oom locked subtree of the same
> hierarchy:
>           A
>         /   \
>        B     \
>       /\      \
>      C  D     E
> 
> B - C - D tree might be already locked. While we want to enable locking E
> subtree because OOM situations cannot influence each other we definitely
> do not want to allow locking A.
> Therefore we have to refuse lock if any subtree is already locked and
> clear up the lock for all nodes that have been set up to the failure
> point.
> Unlock path is then very easy because we always unlock only that subtree
> we have locked previously.
> 
> Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
> ---
>  mm/memcontrol.c |   48 +++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 39 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..29f00d0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1803,22 +1803,51 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
> + * Has to be called with memcg_oom_mutex
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int x, lock_count = 0;
> -	struct mem_cgroup *iter;
> +	int x, lock_count = -1;
> +	struct mem_cgroup *iter, *failed = NULL;
> +	bool cond = true;
>  
> -	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> +		if (lock_count == -1)
> +			lock_count = x;
> +		else if (lock_count != x) {
> +			/*
> +			 * this subtree of our hierarchy is already locked
> +			 * so we cannot give a lock.
> +			 */
> +			lock_count = 0;
> +			failed = iter;
> +			cond = false;
> +		}
>  	}

Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
Before lock
  A  0
  B  1
  C  1
  D  1
  E  0

After lock
  A  1
  B  1
  C  1
  D  1
  E  0

here, failed = B, cond = false. Undo routine will unlock A.
Hmm, seems to work in this case.

But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
know "A" is in OOM. wakeup processes in A which is waiting for oom recover..

Will this work ?
==
 # cgcreate -g memory:A
 # cgset -r memory.use_hierarchy=1 A
 # cgset -r memory.oom_control=1   A
 # cgset -r memory.limit_in_bytes= 100M
 # cgset -r memory.memsw.limit_in_bytes= 100M
 # cgcreate -g memory:A/B
 # cgset -r memory.oom_control=1 A/B
 # cgset -r memory.limit_in_bytes=20M
 # cgset -r memory.memsw.limit_in_bytes=20M

 Assume malloc XXX is a program allocating XXX Megabytes of memory.

 # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
 # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A

Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.

 # cgset -r memory.memsw.limit_in_bytes=300M A
 # cgset -r memory.limit_in_bytes=300M A

 malloc 80 will end.

Thanks,
-Kame

>  
> -	if (lock_count == 1)
> -		return true;
> -	return false;
> +	if (!failed)
> +		goto done;
> +
> +	/*
> +	 * OK, we failed to lock the whole subtree so we have to clean up
> +	 * what we set up to the failing subtree
> +	 */
> +	cond = true;
> +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> +		if (iter == failed) {
> +			cond = false;
> +			continue;
> +		}
> +		atomic_set(&iter->oom_lock, 0)
> +	}

> +done:
> +	return lock_count;
>  }
>  
> +/*
> + * Has to be called with memcg_oom_mutex
> + */
>  static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
>  {
>  	struct mem_cgroup *iter;
> @@ -1916,7 +1945,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> -	mem_cgroup_oom_unlock(mem);
> +	if (locked)
> +		mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
>  	mutex_unlock(&memcg_oom_mutex);
>  
> -- 
> 1.7.5.4
> 
> 
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>