On Thu, 14 Jul 2011 10:02:59 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > On Wed, 13 Jul 2011 13:05:49 +0200 > Michal Hocko <mhocko@xxxxxxx> wrote: > > > 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter > > which is incremented by mem_cgroup_oom_lock when we are about to handle > > memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if > > oom_lock > 1 to prevent from multiple oom kills at the same time. > > The counter is then decremented by mem_cgroup_oom_unlock called from the > > same function. > > > > This works correctly but it can lead to serious starvations when we > > have many processes triggering OOM. > > > > Consider a process (call it A) which gets the oom_lock (the first one > > that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All > > other processes are blocked on the mutex. > > While A releases the mutex and calls mem_cgroup_out_of_memory others > > will wake up (one after another) and increase the counter and fall into > > sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it > > takes the mutex again and decreases oom_lock and wakes other tasks (if > > releasing memory of the killed task hasn't done it yet). > > The main problem here is that everybody still race for the mutex and > > there is no guarantee that we will get counter back to 0 for those > > that got back to mem_cgroup_handle_oom. In the end the whole convoy > > in/decreases the counter but we do not get to 1 that would enable > > killing so nothing useful is going on. > > The time is basically unbounded because it highly depends on scheduling > > and ordering on mutex. > > > > Hmm, ok, I see the problem. > > > > This patch replaces the counter by a simple {un}lock semantic. We are > > using only 0 and 1 to distinguish those two states. > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure > > that we cannot race with somebody else which is already guaranteed > > because we call both functions with the mutex held. All other consumers > > just read the value atomically for a single group which is sufficient > > because we set the value atomically. > > The other thing is that only that process which locked the oom will > > unlock it once the OOM is handled. > > > > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> > > --- > > mm/memcontrol.c | 24 +++++++++++++++++------- > > 1 files changed, 17 insertions(+), 7 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index e013b8e..f6c9ead 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, > > /* > > * Check OOM-Killer is already running under our hierarchy. > > * If someone is running, return false. > > + * Has to be called with memcg_oom_mutex > > */ > > static bool mem_cgroup_oom_lock(struct mem_cgroup *mem) > > { > > - int x, lock_count = 0; > > + int x, lock_count = -1; > > struct mem_cgroup *iter; > > > > for_each_mem_cgroup_tree(iter, mem) { > > - x = atomic_inc_return(&iter->oom_lock); > > - lock_count = max(x, lock_count); > > + x = !!atomic_add_unless(&iter->oom_lock, 1, 1); > > + if (lock_count == -1) > > + lock_count = x; > > + > > > Hmm...Assume following hierarchy. > > A > B C > D E > > The orignal code hanldes the situation > > 1. B-D-E is under OOM > 2. A enters OOM after 1. > > In original code, A will not invoke OOM (because B-D-E oom will kill a process.) > The new code invokes A will invoke new OOM....right ? > > I wonder this kind of code > == > bool success = true; > ... > for_each_mem_cgroup_tree(iter, mem) { > success &= !!atomic_add_unless(&iter->oom_lock, 1, 1); > /* "break" loop is not allowed because of css refcount....*/ > } > return success. > == > Then, one hierarchy can invoke one OOM kill within it. > But this will not work because we can't do proper unlock. > > > Hm. how about this ? This has only one lock point and we'll not see the BUG. > Not tested yet.. > Here, tested patch + test program. this seems to work well. == >From 8c878b3413b4d796844dbcb18fa7cfccf44860d7 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> Date: Thu, 14 Jul 2011 11:36:50 +0900 Subject: [PATCH] memcg: fix livelock at oom. 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter which is incremented by mem_cgroup_oom_lock when we are about to handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if oom_lock > 1 to prevent from multiple oom kills at the same time. The counter is then decremented by mem_cgroup_oom_unlock called from the same function. This works correctly but it can lead to serious starvations when we have many processes triggering OOM. example. Make a hierarchy of memcg, which has 300MB memory+swap limit. %cgcreate -g memory:A %cgset -r memory.use_hierarchy=1 A %cgset -r memory.limit_in_bytes=300M A %cgset -r memory.memsw.limit_in_bytes=300M A %cgcreate -g memory:A/B %cgcreate -g memory:A/C %cgcreate -g memory:A/B/X %cgcreate -g memory:A/B/Y Then, running folloing program under A/B/X. %cgexec -g memory:A/B/X ./fork == int main(int argc, char *argv[]) { int i; int status; for (i = 0; i < 5000; i++) { if (fork() == 0) { char *c; c = malloc(1024*1024); memset(c, 0, 1024*1024); sleep(20); fprintf(stderr, "[%d]\n", i); exit(0); } printf("%d\n", i); waitpid(-1, &status, WNOHANG); } while (1) { int ret; ret = waitpid(-1, &status, WNOHANG); if (ret == -1) break; if (!ret) sleep(1); } return 0; } == This forks a process and the child malloc(1M). Then, after forking 300 childrens, the memcg goes int OOM. Expected behavior is oom-killer will kill process and make progress. But, 300 children will try to get oom_lock and oom kill...and the program seems not to make progress. The reason is that memcg invokes OOM-Kill when the counter oom_lock is 0. But if many process runs, it never goes down to 0 because it's incremanted before all processes quits sleep by previous oom-lock, which decremetns oom_lock. This patch fixes the behavior. This patch makes the oom-hierarchy should have only one lock value 1/0. For example, in following hierarchy, A / B / \ C D When C goes into OOM because of limit by B, set B->oom_lock=1 After that, when A goes into OOM because of limit by A, clear B->oom_lock as 0 and set A->oom_lock=1. At unlocking, the ancestor which has ()->oom_lock=1 will be cleared. After this, above program will do fork 5000 procs with 4000+ oom-killer. Reported-by: Michal Hocko <mhocko@xxxxxxx> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> Changelog: - fixed under_oom counter reset. --- mm/memcontrol.c | 77 +++++++++++++++++++++++++++++++++++++----------------- 1 files changed, 53 insertions(+), 24 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e013b8e..5f9661b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -246,7 +246,8 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; - atomic_t oom_lock; + int oom_lock; + atomic_t under_oom; atomic_t refcnt; unsigned int swappiness; @@ -1801,36 +1802,63 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, } /* - * Check OOM-Killer is already running under our hierarchy. + * Check whether OOM-Killer is already running under our hierarchy. * If someone is running, return false. */ -static bool mem_cgroup_oom_lock(struct mem_cgroup *mem) +static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg) { - int x, lock_count = 0; struct mem_cgroup *iter; + bool ret; - for_each_mem_cgroup_tree(iter, mem) { - x = atomic_inc_return(&iter->oom_lock); - lock_count = max(x, lock_count); + /* + * If an ancestor (including this memcg) is the owner of OOM Lock. + * return false; + */ + for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) { + if (iter->oom_lock) + break; + if (!iter->use_hierarchy) { + iter = NULL; + break; + } } - if (lock_count == 1) - return true; - return false; + if (iter) + return false; + /* + * Make the owner of OOM lock to be the highest ancestor of hierarchy + * under OOM. IOW, move children's OOM owner information to this memcg + * if a child is owner. In this case, an OOM killer is running and + * we return false. But make this memcg as owner of oom-lock. + */ + ret = true; + for_each_mem_cgroup_tree(iter, memcg) { + if (iter->oom_lock) { + iter->oom_lock = 0; + ret = false; + } + atomic_set(&iter->under_oom, 1); + } + /* Make this memcg as the owner of OOM lock. */ + memcg->oom_lock = 1; + return ret; } -static int mem_cgroup_oom_unlock(struct mem_cgroup *mem) +static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg) { - struct mem_cgroup *iter; + struct mem_cgroup *iter, *iter2; - /* - * When a new child is created while the hierarchy is under oom, - * mem_cgroup_oom_lock() may not be called. We have to use - * atomic_add_unless() here. - */ - for_each_mem_cgroup_tree(iter, mem) - atomic_add_unless(&iter->oom_lock, -1, 0); - return 0; + for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) { + if (iter->oom_lock) { + iter->oom_lock = 0; + break; + } + } + BUG_ON(!iter); + + for_each_mem_cgroup_tree(iter2, iter) + atomic_set(&iter2->under_oom, 0); + return; } @@ -1875,7 +1903,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) static void memcg_oom_recover(struct mem_cgroup *mem) { - if (mem && atomic_read(&mem->oom_lock)) + if (mem && atomic_read(&mem->under_oom)) memcg_wakeup_oom(mem); } @@ -1916,7 +1944,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) finish_wait(&memcg_oom_waitq, &owait.wait); } mutex_lock(&memcg_oom_mutex); - mem_cgroup_oom_unlock(mem); + if (locked) + mem_cgroup_oom_unlock(mem); memcg_wakeup_oom(mem); mutex_unlock(&memcg_oom_mutex); @@ -4584,7 +4613,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp, list_add(&event->list, &memcg->oom_notify); /* already in OOM ? */ - if (atomic_read(&memcg->oom_lock)) + if (atomic_read(&memcg->under_oom)) eventfd_signal(eventfd, 1); mutex_unlock(&memcg_oom_mutex); @@ -4619,7 +4648,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp, cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable); - if (atomic_read(&mem->oom_lock)) + if (atomic_read(&mem->under_oom)) cb->fill(cb, "under_oom", 1); else cb->fill(cb, "under_oom", 0); -- 1.7.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>