On Wed, 24 Feb 2010 16:59:21 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > These are dump of patches just for showing concept, what I want to do. > But not tested. please see if you have free time. (you can ignore ;) > > Anyway, this will HUNK to the latest mmotm, Kirill's work is merged. > > This is not related to David's work. I don't hesitate to rebase mine > to the mmotm if his one is merged, it's easy. > But I'm not sure his one goes to mm soon. > > 1st patch is for better handling oom-kill under memcg. It's bigger than I expected, but it basically looks good to me. > 2nd patch is oom-notifier and oom-kill-disable for memcg knob. > This feature is very atractive. One comment to this patch for now. > +/* > + * Check there are ongoing oom-kill in this hierarchy or not. > + * If now under oom-kill, wait for some event to restart job. > + */ > +static bool memcg_handle_oom(struct mem_cgroup *mem, gfp_t mask) > +{ > + int oom_count = 0; > + DEFINE_WAIT(wait); > + /* > + * Considering hierarchy (below) > + * /A > + * /01 > + * /02 > + * If 01 or 02 is under oom-kill, oom-kill in A should wait. > + * If "A" is under oom-kill, oom-kill in 01 and 02 should wait. > + * (task in 01/02 can be killed.) > + * But if 01 is under oom-kill, 02's oom-kill is independent from it. > + */ > + prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE); > + mem_cgroup_walk_tree(mem, &oom_count, set_memcg_oom_cb); > + /* Am I the 1st oom killer in this sub hierarchy ? */ > + if (oom_count == 1) { > + finish_wait(&memcg_oom_waitq, &wait); > + mem_cgroup_out_of_memory(mem, mask); > + mem_cgroup_walk_tree(mem, NULL, unset_memcg_oom_cb); I think we need call memcg_oom_wake() here. Some contexts might have slept already, but other callers of memcg_oom_wake() calle it after checking memcg_under_oom(), so if we don't wake them up here, they continue to sleep, IIUC. Thanks, Daisuke Nishimura. > I'm sorry that I'll be absent tomorrow. > == > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > > This is updated version of oom handling improvement om memcg. > But all codes are totaly renewed. This may not be sophisiticated well but > enough for showing idea. > > This patch does following things. > * set "memcg is under OOM" if somone gets into OOM under a memcg. > like zone's OOM lock, tree-of-memcg is marked as under OOM. > By this. simlutabeous OOM kill in a tree will not happen. > > * When other threads try to reclaim memory or call oom-kill, it > checks its own target memcg is under oom or not. If someone > calls oom-killer already, the thread will be queued to waitq. > > * At some event which makes room for new memory, threads on waitq > are waken up. > ** A page (or chunk of pages) are unchraged. > ** A task is moved. > ** limit is enlarged. > > And this patch also allows to check "current's memcg is changed or not" > while charging. > > Considering what admin/daemon can do when it notice OOM, > * kill a process > * move a process (to other cgroup which has free area) > * remove a file (on tmpfs or some) > * enlarge limit > I think all chances for wakeing up waiters are covered by these. > > After this patch, memcg's accounting will not fail in usual path. > If all tasks are OOM_DISABLE, memcg may hang. But admin can have > several options described in above. So, oom notifier+freeze should be > implemented. > > TODO: maybe not difficult. > * Add oom notifier. (can reuse memory.threashold ?) > * Add a switch for oom-freeze rather than oom-kill. > > Cc: David Rientjes <rientjes@xxxxxxxxxx> > Cc: Balbir Singh <balbir@xxxxxxxxxx> > Cc: Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > --- > include/linux/memcontrol.h | 6 - > mm/memcontrol.c | 208 +++++++++++++++++++++++++++++++++++---------- > mm/oom_kill.c | 11 -- > 3 files changed, 167 insertions(+), 58 deletions(-) > > Index: mmotm-2.6.33-Feb11/mm/memcontrol.c > =================================================================== > --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c > +++ mmotm-2.6.33-Feb11/mm/memcontrol.c > @@ -200,7 +200,6 @@ struct mem_cgroup { > * Should the accounting and control be hierarchical, per subtree? > */ > bool use_hierarchy; > - unsigned long last_oom_jiffies; > atomic_t refcnt; > > unsigned int swappiness; > @@ -223,6 +222,8 @@ struct mem_cgroup { > */ > unsigned long move_charge_at_immigrate; > > + /* counting ongoing OOM requests under sub hierarchy */ > + atomic_t oom_count; > /* > * percpu counter. > */ > @@ -1096,6 +1097,89 @@ done: > } > > /* > + * set/under memcg_oom counting is done under mutex. > + */ > +static DEFINE_MUTEX(memcg_oom_mutex); > +static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq); > + > +static int set_memcg_oom_cb(struct mem_cgroup *mem, void *data) > +{ > + int *max_count = (int*)data; > + int count = atomic_inc_return(&mem->oom_count); > + if (*max_count < count) > + *max_count = count; > + return 0; > +} > + > +static int unset_memcg_oom_cb(struct mem_cgroup *mem, void *data) > +{ > + atomic_set(&mem->oom_count, 0); > + return 0; > +} > + > +static bool memcg_under_oom(struct mem_cgroup *mem) > +{ > + if (atomic_read(&mem->oom_count)) > + return true; > + return false; > +} > + > +static void memcg_oom_wait(struct mem_cgroup *mem) > +{ > + DEFINE_WAIT(wait); > + > + prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE); > + if (memcg_under_oom(mem)) > + schedule(); > + finish_wait(&memcg_oom_waitq, &wait); > +} > + > +static void memcg_oom_wake(void) > +{ > + /* This may wake up unnecessary tasks..but it's not big problem */ > + wake_up_all(&memcg_oom_waitq); > +} > +/* > + * Check there are ongoing oom-kill in this hierarchy or not. > + * If now under oom-kill, wait for some event to restart job. > + */ > +static bool memcg_handle_oom(struct mem_cgroup *mem, gfp_t mask) > +{ > + int oom_count = 0; > + DEFINE_WAIT(wait); > + /* > + * Considering hierarchy (below) > + * /A > + * /01 > + * /02 > + * If 01 or 02 is under oom-kill, oom-kill in A should wait. > + * If "A" is under oom-kill, oom-kill in 01 and 02 should wait. > + * (task in 01/02 can be killed.) > + * But if 01 is under oom-kill, 02's oom-kill is independent from it. > + */ > + prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE); > + mem_cgroup_walk_tree(mem, &oom_count, set_memcg_oom_cb); > + /* Am I the 1st oom killer in this sub hierarchy ? */ > + if (oom_count == 1) { > + finish_wait(&memcg_oom_waitq, &wait); > + mem_cgroup_out_of_memory(mem, mask); > + mem_cgroup_walk_tree(mem, NULL, unset_memcg_oom_cb); > + } else { > + /* > + * Wakeup is called when > + * 1. pages are uncharged. (by killed, or removal of a file) > + * 2. limit is enlarged. > + * 3. a task is moved. > + */ > + schedule(); > + finish_wait(&memcg_oom_waitq, &wait); > + } > + if (test_thread_flag(TIF_MEMDIE)) > + return false; > + return true; > +} > + > +/* > * This function returns the number of memcg under hierarchy tree. Returns > * 1(self count) if no children. > */ > @@ -1234,34 +1318,6 @@ static int mem_cgroup_hierarchical_recla > return total; > } > > -bool mem_cgroup_oom_called(struct task_struct *task) > -{ > - bool ret = false; > - struct mem_cgroup *mem; > - struct mm_struct *mm; > - > - rcu_read_lock(); > - mm = task->mm; > - if (!mm) > - mm = &init_mm; > - mem = mem_cgroup_from_task(rcu_dereference(mm->owner)); > - if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10)) > - ret = true; > - rcu_read_unlock(); > - return ret; > -} > - > -static int record_last_oom_cb(struct mem_cgroup *mem, void *data) > -{ > - mem->last_oom_jiffies = jiffies; > - return 0; > -} > - > -static void record_last_oom(struct mem_cgroup *mem) > -{ > - mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb); > -} > - > /* > * Currently used to update mapped file statistics, but the routine can be > * generalized to update other statistics as well. > @@ -1419,6 +1475,7 @@ static int __cpuinit memcg_stock_cpu_cal > return NOTIFY_OK; > } > > + > /* > * Unlike exported interface, "oom" parameter is added. if oom==true, > * oom-killer can be invoked. > @@ -1427,17 +1484,21 @@ static int __mem_cgroup_try_charge(struc > gfp_t gfp_mask, struct mem_cgroup **memcg, > bool oom, struct page *page) > { > - struct mem_cgroup *mem, *mem_over_limit; > - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + struct mem_cgroup *mem, *mem_over_limit, *recorded; > + int nr_retries, csize; > struct res_counter *fail_res; > - int csize = CHARGE_SIZE; > + > +start: > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + recorded = *memcg; > + csize = CHARGE_SIZE; > + mem = NULL; > > if (unlikely(test_thread_flag(TIF_MEMDIE))) { > /* Don't account this! */ > *memcg = NULL; > return 0; > } > - > /* > * We always charge the cgroup the mm_struct belongs to. > * The mm_struct's mem_cgroup changes on task migration if the > @@ -1489,6 +1550,12 @@ static int __mem_cgroup_try_charge(struc > } > if (!(gfp_mask & __GFP_WAIT)) > goto nomem; > + /* already in OOM ? */ > + if (memcg_under_oom(mem_over_limit)) { > + /* Don't add too much pressure to the host */ > + memcg_oom_wait(mem_over_limit); > + goto retry; > + } > > ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL, > gfp_mask, flags); > @@ -1549,11 +1616,15 @@ static int __mem_cgroup_try_charge(struc > } > > if (!nr_retries--) { > - if (oom) { > - mem_cgroup_out_of_memory(mem_over_limit, gfp_mask); > - record_last_oom(mem_over_limit); > - } > - goto nomem; > + > + if (!oom) > + goto nomem; > + /* returnes false if current is killed */ > + if (memcg_handle_oom(mem_over_limit, gfp_mask)) > + goto retry; > + /* For smooth oom-kill of current, return 0 */ > + css_put(&mem->css); > + return 0; > } > } > if (csize > PAGE_SIZE) > @@ -1572,6 +1643,15 @@ done: > nomem: > css_put(&mem->css); > return -ENOMEM; > + > +retry: > + /* > + * current's mem_cgroup can be moved while we're waiting for > + * memory reclaim or OOM-Kill. > + */ > + *memcg = recorded; > + css_put(&mem->css); > + goto start; > } > > /* > @@ -1589,6 +1669,9 @@ static void __mem_cgroup_cancel_charge(s > VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags)); > WARN_ON_ONCE(count > INT_MAX); > __css_put(&mem->css, (int)count); > + > + if (memcg_under_oom(mem)) > + memcg_oom_wake(); > } > /* we don't need css_put for root */ > } > @@ -2061,6 +2144,10 @@ direct_uncharge: > res_counter_uncharge(&mem->res, PAGE_SIZE); > if (uncharge_memsw) > res_counter_uncharge(&mem->memsw, PAGE_SIZE); > + /* Slow path to check OOM waiters */ > + if (!current->memcg_batch.do_batch || batch->memcg != mem) > + if (memcg_under_oom(mem)) > + memcg_oom_wake(); > return; > } > > @@ -2200,6 +2287,9 @@ void mem_cgroup_uncharge_end(void) > res_counter_uncharge(&batch->memcg->res, batch->bytes); > if (batch->memsw_bytes) > res_counter_uncharge(&batch->memcg->memsw, batch->memsw_bytes); > + > + if (memcg_under_oom(batch->memcg)) > + memcg_oom_wake(); > /* forget this pointer (for sanity check) */ > batch->memcg = NULL; > } > @@ -2408,8 +2498,7 @@ void mem_cgroup_end_migration(struct mem > > /* > * A call to try to shrink memory usage on charge failure at shmem's swapin. > - * Calling hierarchical_reclaim is not enough because we should update > - * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM. > + * Calling hierarchical_reclaim is not enough because we have to hand oom-kill. > * Moreover considering hierarchy, we should reclaim from the mem_over_limit, > * not from the memcg which this page would be charged to. > * try_charge_swapin does all of these works properly. > @@ -2440,7 +2529,8 @@ static int mem_cgroup_resize_limit(struc > u64 memswlimit; > int ret = 0; > int children = mem_cgroup_count_children(memcg); > - u64 curusage, oldusage; > + u64 curusage, oldusage, curlimit; > + int enlarge = 0; > > /* > * For keeping hierarchical_reclaim simple, how long we should retry > @@ -2451,6 +2541,7 @@ static int mem_cgroup_resize_limit(struc > > oldusage = res_counter_read_u64(&memcg->res, RES_USAGE); > > + > while (retry_count) { > if (signal_pending(current)) { > ret = -EINTR; > @@ -2468,6 +2559,9 @@ static int mem_cgroup_resize_limit(struc > mutex_unlock(&set_limit_mutex); > break; > } > + curlimit = res_counter_read_u64(&memcg->res, RES_LIMIT); > + if (curlimit < val) > + enlarge = 1; > ret = res_counter_set_limit(&memcg->res, val); > if (!ret) { > if (memswlimit == val) > @@ -2477,8 +2571,20 @@ static int mem_cgroup_resize_limit(struc > } > mutex_unlock(&set_limit_mutex); > > - if (!ret) > + if (!ret) { > + /* > + * If we enlarge limit of memcg under OOM, > + * wake up waiters. > + */ > + if (enlarge && memcg_under_oom(memcg)) > + memcg_oom_wake(); > + break; > + } > + /* Under OOM ? If so, don't add more pressure. */ > + if (memcg_under_oom(memcg)) { > + ret = -EBUSY; > break; > + } > > mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL, > MEM_CGROUP_RECLAIM_SHRINK); > @@ -2497,9 +2603,10 @@ static int mem_cgroup_resize_memsw_limit > unsigned long long val) > { > int retry_count; > - u64 memlimit, oldusage, curusage; > + u64 memlimit, oldusage, curusage, curlimit; > int children = mem_cgroup_count_children(memcg); > int ret = -EBUSY; > + int enlarge; > > /* see mem_cgroup_resize_res_limit */ > retry_count = children * MEM_CGROUP_RECLAIM_RETRIES; > @@ -2521,6 +2628,9 @@ static int mem_cgroup_resize_memsw_limit > mutex_unlock(&set_limit_mutex); > break; > } > + curlimit = res_counter_read_u64(&memcg->res, RES_LIMIT); > + if (curlimit < val) > + enlarge = 1; > ret = res_counter_set_limit(&memcg->memsw, val); > if (!ret) { > if (memlimit == val) > @@ -2530,8 +2640,15 @@ static int mem_cgroup_resize_memsw_limit > } > mutex_unlock(&set_limit_mutex); > > - if (!ret) > + if (!ret) { > + if (enlarge && memcg_under_oom(memcg)) > + memcg_oom_wake(); > break; > + } > + if (memcg_under_oom(memcg)) { > + ret = -EBUSY; > + continue; > + } > > mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL, > MEM_CGROUP_RECLAIM_NOSWAP | > @@ -3859,6 +3976,9 @@ one_by_one: > ret = -EINTR; > break; > } > + /* Undo precharges if there is ongoing OOM */ > + if (memcg_under_oom(mem)) > + return -ENOMEM; > if (!batch_count--) { > batch_count = PRECHARGE_COUNT_AT_ONCE; > cond_resched(); > Index: mmotm-2.6.33-Feb11/mm/oom_kill.c > =================================================================== > --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c > +++ mmotm-2.6.33-Feb11/mm/oom_kill.c > @@ -487,6 +487,9 @@ retry: > goto retry; > out: > read_unlock(&tasklist_lock); > + /* give a chance to die for selected process */ > + if (!test_thread_flag(TIF_MEMDIE)) > + schedule_timeout_uninterruptible(1); > } > #endif > > @@ -601,13 +604,6 @@ void pagefault_out_of_memory(void) > /* Got some memory back in the last second. */ > return; > > - /* > - * If this is from memcg, oom-killer is already invoked. > - * and not worth to go system-wide-oom. > - */ > - if (mem_cgroup_oom_called(current)) > - goto rest_and_return; > - > if (sysctl_panic_on_oom) > panic("out of memory from page fault. panic_on_oom is selected.\n"); > > @@ -619,7 +615,6 @@ void pagefault_out_of_memory(void) > * Give "p" a good chance of killing itself before we > * retry to allocate memory. > */ > -rest_and_return: > if (!test_thread_flag(TIF_MEMDIE)) > schedule_timeout_uninterruptible(1); > } > Index: mmotm-2.6.33-Feb11/include/linux/memcontrol.h > =================================================================== > --- mmotm-2.6.33-Feb11.orig/include/linux/memcontrol.h > +++ mmotm-2.6.33-Feb11/include/linux/memcontrol.h > @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(v > return false; > } > > -extern bool mem_cgroup_oom_called(struct task_struct *task); > void mem_cgroup_update_file_mapped(struct page *page, int val); > unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > gfp_t gfp_mask, int nid, > @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(v > return true; > } > > -static inline bool mem_cgroup_oom_called(struct task_struct *task) > -{ > - return false; > -} > - > static inline int > mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg) > { > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>