On Mon, 27 Dec 2010 22:29:05 -0800 (PST) David Rientjes <rientjes@xxxxxxxxxx> wrote: > Completely disabling the oom killer for a memcg is problematic if > userspace is unable to address the condition itself, usually because it > is unresponsive. This scenario creates a memcg deadlock: tasks are > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to > exit or move, or the oom killer reenabled and userspace is unable to do > so. > > An additional possible use case is to defer oom killing within a memcg > for a set period of time, probably to prevent unnecessary kills due to > temporary memory spikes, before allowing the kernel to handle the > condition. > > This patch adds an oom killer delay so that a memcg may be configured to > wait at least a pre-defined number of milliseconds before calling the oom > killer. If the oom condition persists for this number of milliseconds, > the oom killer will be called the next time the memory controller > attempts to charge a page (and memory.oom_control is set to 0). This > allows userspace to have a short period of time to respond to the > condition before deferring to the kernel to kill a task. > > Admins may set the oom killer delay using the new interface: > > # echo 60000 > memory.oom_delay_millisecs > > This will defer oom killing to the kernel only after 60 seconds has > elapsed. When setting memory.oom_delay, all pending delays have their > charge retried and, if necessary, the new delay is then effected. > > The delay is cleared the first time the memcg is oom to avoid unnecessary > waiting when userspace is unresponsive for future oom conditions. It may > be set again using the above interface to enforce a delay on the next > oom. > > Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> Changelog please. > --- > Documentation/cgroups/memory.txt | 26 +++++++++++++++++++++ > mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++--- > 2 files changed, 68 insertions(+), 4 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > index 7781857..e426733 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -68,6 +68,7 @@ Brief summary of control files. > (See sysctl's vm.swappiness) > memory.move_charge_at_immigrate # set/show controls of moving charges > memory.oom_control # set/show oom controls. > + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill > > 1. History > > @@ -640,6 +641,31 @@ At reading, current status of OOM is shown. > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > be stopped.) > > +It is also possible to configure an oom killer timeout to prevent the > +possibility that the memcg will deadlock looking for memory if userspace > +has disabled the oom killer with oom_control but cannot act to fix the > +condition itself (usually because userspace has become unresponsive). > + > +To set an oom killer timeout for a memcg, write the number of milliseconds > +to wait before killing a task to memory.oom_delay_millisecs: > + > + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill > + > +This timeout is reset the first time the memcg is oom to prevent needlessly > +waiting for the next oom when userspace is truly unresponsive. It may be > +set again using the above interface to defer killing a task the next time > +the memcg is oom. > + > +Disabling the oom killer for a memcg with memory.oom_control takes > +precedence over memory.oom_delay_millisecs, so it must be set to 0 > +(default) to allow the oom kill after the delay has expired. > + > +This value is inherited from the memcg's parent on creation. > + > +There is no delay if memory.oom_delay_millisecs is set to 0 (default). > +This tunable's upper bound is 60 minutes. Why upper-bounds is 60 minutes ? Do we have to have a limit ? Hmm, I feel 60minutes is too short. I like 32 or 31 bit limit. > + > + > 11. TODO > > 1. Add support for accounting huge pages (as a separate controller) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e6aadd6..951a22c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -229,6 +229,8 @@ struct mem_cgroup { > unsigned int swappiness; > /* OOM-Killer disable */ > int oom_kill_disable; > + /* number of ticks to stall before calling oom killer */ > + int oom_delay; > > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > /* > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > */ > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > { > struct oom_wait_info owait; > bool locked, need_to_kill; > + long timeout = MAX_SCHEDULE_TIMEOUT; > > owait.mem = mem; > owait.wait.flags = 0; > @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > if (!locked || mem->oom_kill_disable) > need_to_kill = false; > - if (locked) > + if (locked) { > + if (mem->oom_delay) { > + need_to_kill = false; > + timeout = mem->oom_delay; > + mem->oom_delay = 0; > + } > mem_cgroup_oom_notify(mem); > + } > mutex_unlock(&memcg_oom_mutex); > > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(mem, mask); > } else { > - schedule(); > + schedule_timeout(timeout); > finish_wait(&memcg_oom_waitq, &owait.wait); > } > mutex_lock(&memcg_oom_mutex); > @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > return false; > /* Give chance to dying process */ > - schedule_timeout(1); > + if (timeout != MAX_SCHEDULE_TIMEOUT) != ? This seems to change existing behavior. > + schedule_timeout(1); > return true; > } > > @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, > return 0; > } > > +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp, > + struct cftype *cft) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + > + return jiffies_to_msecs(memcg->oom_delay); > +} > + > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp, > + struct cftype *cft, u64 val) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + > + /* Sanity check -- don't wait longer than an hour */ > + if (val > (60 * 60 * 1000)) > + return -EINVAL; > + > + memcg->oom_delay = msecs_to_jiffies(val); > + memcg_oom_recover(memcg); > + return 0; > +} > + Please allow this to the root of sub-hierarchy and no children....(*) (please check how mem_cgroup_oom_lock/unlock() works under use_hierarchy=1) > static struct cftype mem_cgroup_files[] = { > { > .name = "usage_in_bytes", > @@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = { > .unregister_event = mem_cgroup_oom_unregister_event, > .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), > }, > + { > + .name = "oom_delay_millisecs", > + .read_u64 = mem_cgroup_oom_delay_millisecs_read, > + .write_u64 = mem_cgroup_oom_delay_millisecs_write, > + }, > }; > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP > @@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > parent = mem_cgroup_from_cont(cont->parent); > mem->use_hierarchy = parent->use_hierarchy; > mem->oom_kill_disable = parent->oom_kill_disable; > + mem->oom_delay = parent->oom_delay; Becasue of (*), oom_kill_disable can be copied here. If you want to inherit this, you should do (*) or update all hierarchy value. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>