Re: [BUGFIX][PATCH] memcg: fix oom kill behavior.

KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> · Wed, 3 Mar 2010 08:58:24 +0900

On Tue, 2 Mar 2010 22:41:42 +0530
Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> [2010-03-02 11:58:34]:
> 
> > Brief Summary (for Andrew)
> > 
> >  - Nishimura reported my fix (one year ago)
> >    a636b327f731143ccc544b966cfd8de6cb6d72c6
> >    doesn't work well in some extreme situation.
> > 
> >  - David Rientjes said mem_cgroup_oom_called() is completely
> >    ugly and broken and.....
> >    And he tries to remove that in his patch set.
> > 
> > Then, I wrote this as bugfix onto mmotm. This patch implements
> >  - per-memcg OOM lock as per-zone OOM lock
> >  - avoid to return -ENOMEM via mamcg's page fault path.
> >    ENOMEM causes unnecessary page_fault_out_of_memory().
> >    (Even if memcg hangs, there is no change from current behavior)
> >  - in addtion to MEMDIE thread, KILLED proceses go bypath memcg.
> > 
> > I'm glad if this goes into 2.6.34 timeline (as bugfix). But I'm
> > afraid this seems too big as bugfix...
> > 
> > My plans for 2.6.35 are
> >  - oom-notifier for memcg (based on memcg threshold notifier) 
> >  - oom-freezer (disable oom-kill) for memcg
> >  - better handling in extreme situation.
> > And now, Andrea Righi works for dirty_ratio for memcg. We'll have
> > something better in 2.6.35 kernels.
> > 
> > This patch will HUNK with David's set. Then, if this hunks in mmotm,
> > I'll rework.
> >
> 
> Hi, Kamezawa-San,
> 
> Some review comments below.
>  
> > Tested on x86-64. Nishimura-san, could you test ?
> > 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> > 
> > In current page-fault code,
> > 
> > 	handle_mm_fault()
> > 		-> ...
> > 		-> mem_cgroup_charge()
> > 		-> map page or handle error.
> > 	-> check return code.
> > 
> > If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory()
> > is called. But if it's caused by memcg, OOM should have been already
> > invoked.
> > Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6
> > 
> > That patch records last_oom_jiffies for memcg's sub-hierarchy and
> > prevents page_fault_out_of_memory from being invoked in near future.
> > 
> > But Nishimura-san reported that check by jiffies is not enough
> > when the system is terribly heavy. 
> > 
> > This patch changes memcg's oom logic as.
> >  * If memcg causes OOM-kill, continue to retry.
> >  * remove jiffies check which is used now.
> 
> I like this very much!
> 
> >  * add memcg-oom-lock which works like perzone oom lock.
> >  * If current is killed(as a process), bypass charge.
> > 
> > Something more sophisticated can be added but this pactch does
> > fundamental things.
> > TODO:
> >  - add oom notifier
> >  - add permemcg disable-oom-kill flag and freezer at oom.
> >  - more chances for wake up oom waiter (when changing memory limit etc..)
> > 
> > Changelog;
> >  - fixed per-memcg oom lock.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> > ---
> >  include/linux/memcontrol.h |    6 --
> >  mm/memcontrol.c            |  109 +++++++++++++++++++++++++++++++++------------
> >  mm/oom_kill.c              |    8 ---
> >  3 files changed, 82 insertions(+), 41 deletions(-)
> > 
> > Index: mmotm-2.6.33-Feb11/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-2.6.33-Feb11.orig/include/linux/memcontrol.h
> > +++ mmotm-2.6.33-Feb11/include/linux/memcontrol.h
> > @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(v
> >  	return false;
> >  }
> > 
> > -extern bool mem_cgroup_oom_called(struct task_struct *task);
> >  void mem_cgroup_update_file_mapped(struct page *page, int val);
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  						gfp_t gfp_mask, int nid,
> > @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(v
> >  	return true;
> >  }
> > 
> > -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> > -{
> > -	return false;
> > -}
> > -
> >  static inline int
> >  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> >  {
> > Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> > +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> > @@ -200,7 +200,7 @@ struct mem_cgroup {
> >  	 * Should the accounting and control be hierarchical, per subtree?
> >  	 */
> >  	bool use_hierarchy;
> > -	unsigned long	last_oom_jiffies;
> > +	atomic_t	oom_lock;
> >  	atomic_t	refcnt;
> > 
> >  	unsigned int	swappiness;
> > @@ -1234,32 +1234,77 @@ static int mem_cgroup_hierarchical_recla
> >  	return total;
> >  }
> > 
> > -bool mem_cgroup_oom_called(struct task_struct *task)
> > +static int mem_cgroup_oom_lock_cb(struct mem_cgroup *mem, void *data)
> >  {
> > -	bool ret = false;
> > -	struct mem_cgroup *mem;
> > -	struct mm_struct *mm;
> > +	int *val = (int *)data;
> > +	int x;
> > 
> > -	rcu_read_lock();
> > -	mm = task->mm;
> > -	if (!mm)
> > -		mm = &init_mm;
> > -	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > -	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> > -		ret = true;
> > -	rcu_read_unlock();
> > -	return ret;
> > +	x = atomic_inc_return(&mem->oom_lock);
> > +	if (x > *val)
> > +		*val = x;a
> 
> Use the max_t function here?
>         x = max_t(int, x, *val);
> 
Sure.

> > +	return 0;
> > +}
> > +/*
> > + * Check OOM-Killer is already running under our hierarchy.
> > + * If someone is running, return false.
> > + */
> > +static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > +{
> > +	int check = 0;
> > +
> > +	mem_cgroup_walk_tree(mem, &check, mem_cgroup_oom_lock_cb);
> > +
> > +	if (check == 1)
> > +		return true;
> > +	return false;
> >  }
> > 
> > -static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> > +static int mem_cgroup_oom_unlock_cb(struct mem_cgroup *mem, void *data)
> >  {
> > -	mem->last_oom_jiffies = jiffies;
> > +	atomic_dec(&mem->oom_lock);
> >  	return 0;
> >  }
> > 
> > -static void record_last_oom(struct mem_cgroup *mem)
> > +static void mem_cgroup_oom_unlock(struct mem_cgroup *mem)
> >  {
> > -	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> > +	mem_cgroup_walk_tree(mem, NULL,	mem_cgroup_oom_unlock_cb);
> > +}
> > +
> > +static DEFINE_MUTEX(memcg_oom_mutex);
> > +static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
> > +
> > +/*
> > + * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > + */
> > +bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > +{
> > +	DEFINE_WAIT(wait);
> > +	bool locked;
> > +
> > +	prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE);
> > +	/* At first, try to OOM lock hierarchy under mem.*/
> > +	mutex_lock(&memcg_oom_mutex);
> > +	locked = mem_cgroup_oom_lock(mem);
> > +	mutex_unlock(&memcg_oom_mutex);
> > +
> > +	if (locked) {
> > +		finish_wait(&memcg_oom_waitq, &wait);
> > +		mem_cgroup_out_of_memory(mem, mask);
> > +	} else {
> > +		schedule();
> > +		finish_wait(&memcg_oom_waitq, &wait);
> > +	}
> > +	mutex_lock(&memcg_oom_mutex);
> > +	mem_cgroup_oom_unlock(mem);
> > +	/* TODO: more fine grained waitq ? */
> > +	wake_up_all(&memcg_oom_waitq);
> 
> I was wondering if we should really wake up all? Shouldn't this be per
> memcg? The waitq that is, since the check is per memcg, the wakeup
> should also be per memcg.
> 
The difficulty of per-memcg waitq is because of hierarchy.

Assume following hierarhcy A and its children 01 and 02.

	A/            <==== under OOM #2
	  01          <==== under OOM #1
	  02

And the OOM happens in following sequence.
   1.  01 goes to OOM (#1)
   2.  A  goes to OOM (#2)

Because oom-kill in group 01 can fix both oom under A and 01,
oom-kill under A and oom-kill under 01 should be mutual exclusive.

When OOM under 01 wakes up, we have no way to wake up waiters on A.
That's the reason I used system-wide waitq.

I think there is no big problem. But I hope someone finds a new magic
for doing logically correct things. Then, I added a TODO.
I'll add some comments.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>