Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because it is unresponsive. This scenario creates a memcg deadlock: tasks are sitting in TASK_KILLABLE waiting for the limit to be increased, a task to exit or move, or the oom killer reenabled and userspace is unable to do so. An additional possible use case is to defer oom killing within a memcg for a set period of time, probably to prevent unnecessary kills due to temporary memory spikes, before allowing the kernel to handle the condition. This patch adds an oom killer delay so that a memcg may be configured to wait at least a pre-defined number of milliseconds before calling the oom killer. If the oom condition persists for this number of milliseconds, the oom killer will be called the next time the memory controller attempts to charge a page (and memory.oom_control is set to 0). This allows userspace to have a short period of time to respond to the condition before deferring to the kernel to kill a task. Admins may set the oom killer delay using the new interface: # echo 60000 > memory.oom_delay_millisecs This will defer oom killing to the kernel only after 60 seconds has elapsed. When setting memory.oom_delay, all pending delays have their charge retried and, if necessary, the new delay is then effected. The delay is cleared the first time the memcg is oom to avoid unnecessary waiting when userspace is unresponsive for future oom conditions. It may be set again using the above interface to enforce a delay on the next oom. Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> --- Documentation/cgroups/memory.txt | 26 +++++++++++++++++++++ mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++--- 2 files changed, 68 insertions(+), 4 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 7781857..e426733 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -68,6 +68,7 @@ Brief summary of control files. (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges memory.oom_control # set/show oom controls. + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill 1. History @@ -640,6 +641,31 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) +It is also possible to configure an oom killer timeout to prevent the +possibility that the memcg will deadlock looking for memory if userspace +has disabled the oom killer with oom_control but cannot act to fix the +condition itself (usually because userspace has become unresponsive). + +To set an oom killer timeout for a memcg, write the number of milliseconds +to wait before killing a task to memory.oom_delay_millisecs: + + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill + +This timeout is reset the first time the memcg is oom to prevent needlessly +waiting for the next oom when userspace is truly unresponsive. It may be +set again using the above interface to defer killing a task the next time +the memcg is oom. + +Disabling the oom killer for a memcg with memory.oom_control takes +precedence over memory.oom_delay_millisecs, so it must be set to 0 +(default) to allow the oom kill after the delay has expired. + +This value is inherited from the memcg's parent on creation. + +There is no delay if memory.oom_delay_millisecs is set to 0 (default). +This tunable's upper bound is 60 minutes. + + 11. TODO 1. Add support for accounting huge pages (as a separate controller) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6aadd6..951a22c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -229,6 +229,8 @@ struct mem_cgroup { unsigned int swappiness; /* OOM-Killer disable */ int oom_kill_disable; + /* number of ticks to stall before calling oom killer */ + int oom_delay; /* set when res.limit == memsw.limit */ bool memsw_is_minimum; @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem) /* * try to call OOM killer. returns false if we should exit memory-reclaim loop. */ -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + long timeout = MAX_SCHEDULE_TIMEOUT; owait.mem = mem; owait.wait.flags = 0; @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || mem->oom_kill_disable) need_to_kill = false; - if (locked) + if (locked) { + if (mem->oom_delay) { + need_to_kill = false; + timeout = mem->oom_delay; + mem->oom_delay = 0; + } mem_cgroup_oom_notify(mem); + } mutex_unlock(&memcg_oom_mutex); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(mem, mask); } else { - schedule(); + schedule_timeout(timeout); finish_wait(&memcg_oom_waitq, &owait.wait); } mutex_lock(&memcg_oom_mutex); @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) return false; /* Give chance to dying process */ - schedule_timeout(1); + if (timeout != MAX_SCHEDULE_TIMEOUT) + schedule_timeout(1); return true; } @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, return 0; } +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + return jiffies_to_msecs(memcg->oom_delay); +} + +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + /* Sanity check -- don't wait longer than an hour */ + if (val > (60 * 60 * 1000)) + return -EINVAL; + + memcg->oom_delay = msecs_to_jiffies(val); + memcg_oom_recover(memcg); + return 0; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = { .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "oom_delay_millisecs", + .read_u64 = mem_cgroup_oom_delay_millisecs_read, + .write_u64 = mem_cgroup_oom_delay_millisecs_write, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) parent = mem_cgroup_from_cont(cont->parent); mem->use_hierarchy = parent->use_hierarchy; mem->oom_kill_disable = parent->oom_kill_disable; + mem->oom_delay = parent->oom_delay; } if (parent && parent->use_hierarchy) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>