On Wed, Aug 07, 2019 at 02:01:30PM -0700, Andrew Morton wrote: > On Wed, 7 Aug 2019 16:51:38 -0400 Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > However, eb414681d5a0 ("psi: pressure stall information for CPU, > > memory, and IO") introduced a memory pressure metric that quantifies > > the share of wallclock time in which userspace waits on reclaim, > > refaults, swapins. By using absolute time, it encodes all the above > > mentioned variables of hardware capacity and workload behavior. When > > memory pressure is 40%, it means that 40% of the time the workload is > > stalled on memory, period. This is the actual measure for the lack of > > forward progress that users can experience. It's also something they > > expect the kernel to manage and remedy if it becomes non-existent. > > > > To accomplish this, this patch implements a thrashing cutoff for the > > OOM killer. If the kernel determines a sustained high level of memory > > pressure, and thus a lack of forward progress in userspace, it will > > trigger the OOM killer to reduce memory contention. > > > > Per default, the OOM killer will engage after 15 seconds of at least > > 80% memory pressure. These values are tunable via sysctls > > vm.thrashing_oom_period and vm.thrashing_oom_level. > > Could be implemented in userspace? > </troll> We do in fact do this with oomd. But it requires a comprehensive cgroup setup, with complete memory and IO isolation, to protect that daemon from the memory pressure and excessive paging of the rest of the system (mlock doesn't really cut it because you need to potentially allocate quite a few proc dentries and inodes just to walk the process tree and determine a kill target). In a fleet that works fine, since we need to maintain that cgroup infra anyway. But for other users, that's a lot of stack for basic "don't hang forever if I allocate too much memory" functionality.