On Wed, Aug 07, 2019 at 09:59:27AM +0200, Michal Hocko wrote: > On Tue 06-08-19 18:01:50, Johannes Weiner wrote: > > On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote: > [...] > > > > > I'm not sure 10s is the perfect value here, but I do think the kernel > > > > > should try to get out of such a state, where interacting with the > > > > > system is impossible, within a reasonable amount of time. > > > > > > > > > > It could be a little too short for non-interactive number-crunching > > > > > systems... > > > > > > > > Would it be possible to have a module with tunning knobs as parameters > > > > and hook into the PSI infrastructure? People can play with the setting > > > > to their need, we wouldn't really have think about the user visible API > > > > for the tuning and this could be easily adopted as an opt-in mechanism > > > > without a risk of regressions. > > > > It's relatively easy to trigger a livelock that disables the entire > > system for good, as a regular user. It's a little weird to make the > > bug fix for that an opt-in with an extensive configuration interface. > > Yes, I definitely do agree that this is a bug fix more than a > feature. The thing is that we do not know what the proper default is for > a wide variety of workloads so some way of configurability is needed > (level and period). If making this a module would require a lot of > additional code then we need a kernel command line parameter at least. > > A module would have a nice advantage that you can change your > configuration without rebooting. The same can be achieved by a sysfs on > the other hand. That's reasonable. How about my initial patch, but behind a config option and the level and period configurable? --- >From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@xxxxxxxxxxx> Date: Mon, 5 Aug 2019 13:15:16 -0400 Subject: [PATCH] psi: trigger the OOM killer on severe thrashing Over the last few years we have had many reports that the kernel can enter an extended livelock situation under sufficient memory pressure. The system becomes unresponsive and fully IO bound for indefinite periods of time, and often the user has no choice but to reboot. Even though the system is clearly struggling with a shortage of memory, the OOM killer is not engaging reliably. The reason is that with bigger RAM, and in particular with faster SSDs, page reclaim does not necessarily fail in the traditional sense anymore. In the time it takes the CPU to run through the vast LRU lists, there are almost always some cache pages that have finished reading in and can be reclaimed, even before userspace had a chance to access them. As a result, reclaim is nominally succeeding, but userspace is refault-bound and not making significant progress. While this is clearly noticable to human beings, the kernel could not actually determine this state with the traditional memory event counters. We might see a certain rate of reclaim activity or refaults, but how long, or whether at all, userspace is unproductive because of it depends on IO speed, readahead efficiency, as well as memory access patterns and concurrency of the userspace applications. The same number of the VM events could be unnoticed in one system / workload combination, and result in an indefinite lockup in a different one. However, eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") introduced a memory pressure metric that quantifies the share of wallclock time in which userspace waits on reclaim, refaults, swapins. By using absolute time, it encodes all the above mentioned variables of hardware capacity and workload behavior. When memory pressure is 40%, it means that 40% of the time the workload is stalled on memory, period. This is the actual measure for the lack of forward progress that users can experience. It's also something they expect the kernel to manage and remedy if it becomes non-existent. To accomplish this, this patch implements a thrashing cutoff for the OOM killer. If the kernel determines a sustained high level of memory pressure, and thus a lack of forward progress in userspace, it will trigger the OOM killer to reduce memory contention. Per default, the OOM killer will engage after 15 seconds of at least 80% memory pressure. These values are tunable via sysctls vm.thrashing_oom_period and vm.thrashing_oom_level. Ideally, this would be standard behavior for the kernel, but since it involves a new metric and OOM killing, let's be safe and make it an opt-in via CONFIG_THRASHING_OOM. Setting vm.thrashing_oom_level to 0 also disables the feature at runtime. Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Reported-by: "Artem S. Tashkinov" <aros@xxxxxxx> --- Documentation/admin-guide/sysctl/vm.rst | 24 ++++++++ include/linux/psi.h | 5 ++ include/linux/psi_types.h | 6 ++ kernel/sched/psi.c | 74 +++++++++++++++++++++++++ kernel/sysctl.c | 20 +++++++ mm/Kconfig | 20 +++++++ 6 files changed, 149 insertions(+) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 64aeee1009ca..0332cb52bcfc 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -66,6 +66,8 @@ files can be found in mm/swap.c. - stat_interval - stat_refresh - numa_stat +- thrashing_oom_level +- thrashing_oom_period - swappiness - unprivileged_userfaultfd - user_reserve_kbytes @@ -825,6 +827,28 @@ When page allocation performance is not a bottleneck and you want all echo 1 > /proc/sys/vm/numa_stat +thrashing_oom_level +=================== + +This defines the memory pressure level for severe thrashing at which +the OOM killer will be engaged. + +The default is 80. This means the system is considered to be thrashing +severely when all active tasks are collectively stalled on memory +(waiting for page reclaim, refaults, swapins etc) for 80% of the time. + +A setting of 0 will disable thrashing-based OOM killing. + + +thrashing_oom_period +=================== + +This defines the number of seconds the system must sustain severe +thrashing at thrashing_oom_level before the OOM killer is invoked. + +The default is 15. + + swappiness ========== diff --git a/include/linux/psi.h b/include/linux/psi.h index 7b3de7321219..661ce45900f9 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -37,6 +37,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, poll_table *wait); #endif +#ifdef CONFIG_THRASHING_OOM +extern unsigned int sysctl_thrashing_oom_level; +extern unsigned int sysctl_thrashing_oom_period; +#endif + #else /* CONFIG_PSI */ static inline void psi_init(void) {} diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 07aaf9b82241..7c57d7e5627e 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -162,6 +162,12 @@ struct psi_group { u64 polling_total[NR_PSI_STATES - 1]; u64 polling_next_update; u64 polling_until; + +#ifdef CONFIG_THRASHING_OOM + /* Severe thrashing state tracking */ + bool oom_pressure; + u64 oom_pressure_start; +#endif }; #else /* CONFIG_PSI */ diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index f28342dc65ec..4b1b620d6359 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -139,6 +139,7 @@ #include <linux/ctype.h> #include <linux/file.h> #include <linux/poll.h> +#include <linux/oom.h> #include <linux/psi.h> #include "sched.h" @@ -177,6 +178,14 @@ struct psi_group psi_system = { .pcpu = &system_group_pcpu, }; +#ifdef CONFIG_THRASHING_OOM +static void psi_oom_tick(struct psi_group *group, u64 now); +#else +static inline void psi_oom_tick(struct psi_group *group, u64 now) +{ +} +#endif + static void psi_avgs_work(struct work_struct *work); static void group_init(struct psi_group *group) @@ -403,6 +412,8 @@ static u64 update_averages(struct psi_group *group, u64 now) calc_avgs(group->avg[s], missed_periods, sample, period); } + psi_oom_tick(group, now); + return avg_next_update; } @@ -1280,3 +1291,66 @@ static int __init psi_proc_init(void) return 0; } module_init(psi_proc_init); + +#ifdef CONFIG_THRASHING_OOM +/* + * Trigger the OOM killer when detecting severe thrashing. + * + * Per default we define severe thrashing as 15 seconds of 80% memory + * pressure (i.e. all active tasks are collectively stalled on memory + * 80% of the time). + */ +unsigned int sysctl_thrashing_oom_level = 80; +unsigned int sysctl_thrashing_oom_period = 15; + +static void psi_oom_tick(struct psi_group *group, u64 now) +{ + struct oom_control oc = { + .order = 0, + }; + unsigned long pressure; + bool high; + + /* Disabled at runtime */ + if (!sysctl_thrashing_oom_level) + return; + + /* + * Protect the system from livelocking due to thrashing. Leave + * per-cgroup policies to oomd, lmkd etc. + */ + if (group != &psi_system) + return; + + pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]); + high = pressure >= sysctl_thrashing_oom_level; + + if (!group->oom_pressure && !high) + return; + + if (!group->oom_pressure && high) { + group->oom_pressure = true; + group->oom_pressure_start = now; + return; + } + + if (group->oom_pressure && !high) { + group->oom_pressure = false; + return; + } + + if (now < group->oom_pressure_start + + (u64)sysctl_thrashing_oom_period * NSEC_PER_SEC) + return; + + pr_warn("Severe thrashing detected! (%ds of %d%% memory pressure)\n", + sysctl_thrashing_oom_period, sysctl_thrashing_oom_level); + + group->oom_pressure = false; + + if (!mutex_trylock(&oom_lock)) + return; + out_of_memory(&oc); + mutex_unlock(&oom_lock); +} +#endif /* CONFIG_THRASHING_OOM */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f12888971d66..3b9b3deb1836 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -68,6 +68,7 @@ #include <linux/bpf.h> #include <linux/mount.h> #include <linux/userfaultfd_k.h> +#include <linux/psi.h> #include "../lib/kstrtox.h" @@ -1746,6 +1747,25 @@ static struct ctl_table vm_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, +#endif +#ifdef CONFIG_THRASHING_OOM + { + .procname = "thrashing_oom_level", + .data = &sysctl_thrashing_oom_level, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &one_hundred, + }, + { + .procname = "thrashing_oom_period", + .data = &sysctl_thrashing_oom_period, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + }, #endif { } }; diff --git a/mm/Kconfig b/mm/Kconfig index 56cec636a1fc..cef13b423beb 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -736,4 +736,24 @@ config ARCH_HAS_PTE_SPECIAL config ARCH_HAS_HUGEPD bool +config THRASHING_OOM + bool "Trigger the OOM killer on severe thrashing" + select PSI + help + Under memory pressure, the kernel can enter severe thrashing + or swap storms during which the system is fully IO-bound and + does not respond to any user input. The OOM killer does not + always engage because page reclaim manages to make nominal + forward progress, but the system is effectively livelocked. + + This feature uses pressure stall information (PSI) to detect + severe thrashing and trigger the OOM killer. + + The OOM killer will be engaged when the system sustains a + memory pressure level of 80% for 15 seconds. This can be + adjusted using the vm.thrashing_oom_[level|period] sysctls. + + Say Y if you have observed your system becoming unresponsive + for extended periods under memory pressure. + endmenu -- 2.22.0