On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote: > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > Hello, > > > > There's this bug which has been bugging many people for many years > > already and which is reproducible in less than a few minutes under the > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > defaults. > > > > Steps to reproduce: > > > > 1) Boot with mem=4G > > 2) Disable swap to make everything faster (sudo swapoff -a) > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > Once you hit a situation when opening a new tab requires more RAM than > > is currently available, the system will stall hard. You will barely be > > able to move the mouse pointer. Your disk LED will be flashing > > incessantly (I'm not entirely sure why). You will not be able to run new > > applications or close currently running ones. > > > This little crisis may continue for minutes or even longer. I think > > that's not how the system should behave in this situation. I believe > > something must be done about that to avoid this stall. > > Yeah that's a known problem, made worse SSD's in fact, as they are able > to keep refaulting the last remaining file pages fast enough, so there > is still apparent progress in reclaim and OOM doesn't kick in. > > At this point, the likely solution will be probably based on pressure > stall monitoring (PSI). I don't know how far we are from a built-in > monitor with reasonable defaults for a desktop workload, so CCing > relevant folks. Yes, psi was specifically developed to address this problem. Before it, the kernel had to make all decisions based on relative event rates but had no notion of time. Whereas to the user, time is clearly an issue, and in fact makes all the difference. So psi quantifies the time the workload spends executing vs. spinning its wheels. But choosing a universal cutoff for killing is not possible, since it depends on the workload and the user's expectation: GUI and other latency-sensitive applications care way before a compile job or video encoding would care. Because of that, there are things like oomd and lmkd as mentioned, to leave the exact policy decision to userspace. That being said, I think we should be able to provide a bare minimum inside the kernel to avoid complete livelocks where the user does not believe the machine would be able to recover without a reboot. The goal wouldn't be a glitch-free user experience - the kernel does not know enough about the applications to even attempt that. It should just not hang indefinitely. Maybe similar to the hung task detector. How about something like the below patch? With that, the kernel catches excessive thrashing that happens before reclaim fails: [root@ham ~]# stress -d 128 -m 5 stress: info: [344] dispatching hogs: 0 cpu, 0 io, 5 vm, 128 hdd Excessive and sustained system-wide memory pressure! kworker/1:2 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0 CPU: 1 PID: 77 Comm: kworker/1:2 Not tainted 5.3.0-rc1-mm1-00121-ge34a5cf28771 #142 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 Workqueue: events psi_avgs_work Call Trace: dump_stack+0x46/0x60 dump_header+0x5c/0x3d5 ? irq_work_queue+0x46/0x50 ? wake_up_klogd+0x2b/0x30 ? vprintk_emit+0xe5/0x190 oom_kill_process.cold.10+0xb/0x10 out_of_memory+0x1ea/0x260 update_averages.cold.8+0x14/0x25 ? collect_percpu_times+0x84/0x1f0 psi_avgs_work+0x80/0xc0 process_one_work+0x1bb/0x310 worker_thread+0x28/0x3c0 ? process_one_work+0x310/0x310 kthread+0x108/0x120 ? __kthread_create_on_node+0x170/0x170 ret_from_fork+0x35/0x40 Mem-Info: active_anon:109463 inactive_anon:109564 isolated_anon:298 active_file:4676 inactive_file:4073 isolated_file:455 unevictable:0 dirty:8475 writeback:8 unstable:0 slab_reclaimable:2585 slab_unreclaimable:4932 mapped:413 shmem:2 pagetables:1747 bounce:0 free:13472 free_pcp:17 free_cma:0 Possible snags and questions: 1. psi is an optional feature right now, but these livelocks commonly affect desktop users. What should be the default behavior? 2. Should we make the pressure cutoff and time period configurable? I fear we would open a can of worms similar to the existing OOM killer, where users are trying to use a kernel self-protection mechanism to implement workload QoS and priorities - things that should firmly be kept in userspace. 3. swapoff annotation. Due to the swapin annotation, swapoff currently raises memory pressure. It probably shouldn't. But this will be a bigger problem if we trigger the oom killer based on it. 4. Killing once every 10s assumes basically one big culprit. If the pressure is created by many different processes, fixing the situation could take quite a while. What oomd does to solve this is to monitor the PGSCAN counters after a kill, to tell whether pressure is persisting, or just from residual refaults after the culprit has been dealt with. We may need to do something similar here. Or find a solution to encode that distinction into psi itself, and it would also take care of the swapoff problem, since it's basically the same thing - residual refaults without any reclaim pressure to sustain them. Anyway, here is the draft patch: >From e34a5cf28771d69f13faa0e933adeae44b26b8aa Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@xxxxxxxxxxx> Date: Mon, 5 Aug 2019 13:15:16 -0400 Subject: [PATCH] psi oom --- include/linux/psi_types.h | 4 +++ kernel/sched/psi.c | 52 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 07aaf9b82241..390446b07ac7 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -162,6 +162,10 @@ struct psi_group { u64 polling_total[NR_PSI_STATES - 1]; u64 polling_next_update; u64 polling_until; + + /* Out-of-memory situation tracking */ + bool oom_pressure; + u64 oom_pressure_start; }; #else /* CONFIG_PSI */ diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index f28342dc65ec..1027b6611ec2 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -139,6 +139,7 @@ #include <linux/ctype.h> #include <linux/file.h> #include <linux/poll.h> +#include <linux/oom.h> #include <linux/psi.h> #include "sched.h" @@ -177,6 +178,8 @@ struct psi_group psi_system = { .pcpu = &system_group_pcpu, }; +static void psi_oom_tick(struct psi_group *group, u64 now); + static void psi_avgs_work(struct work_struct *work); static void group_init(struct psi_group *group) @@ -403,6 +406,8 @@ static u64 update_averages(struct psi_group *group, u64 now) calc_avgs(group->avg[s], missed_periods, sample, period); } + psi_oom_tick(group, now); + return avg_next_update; } @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) return 0; } module_init(psi_proc_init); + +#define OOM_PRESSURE_LEVEL 80 +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) + +static void psi_oom_tick(struct psi_group *group, u64 now) +{ + struct oom_control oc = { + .order = 0, + }; + unsigned long pressure; + bool high; + + /* + * Protect the system from livelocking due to thrashing. Leave + * per-cgroup policies to oomd, lmkd etc. + */ + if (group != &psi_system) + return; + + pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]); + high = pressure >= OOM_PRESSURE_LEVEL; + + if (!group->oom_pressure && !high) + return; + + if (!group->oom_pressure && high) { + group->oom_pressure = true; + group->oom_pressure_start = now; + return; + } + + if (group->oom_pressure && !high) { + group->oom_pressure = false; + return; + } + + if (now < group->oom_pressure_start + OOM_PRESSURE_PERIOD) + return; + + group->oom_pressure = false; + + if (!mutex_trylock(&oom_lock)) + return; + pr_warn("Excessive and sustained system-wide memory pressure!\n"); + out_of_memory(&oc); + mutex_unlock(&oom_lock); +} -- 2.22.0