On 4/21/21 9:18 PM, Shakeel Butt wrote: > On Wed, Apr 21, 2021 at 11:46 AM <Peter.Enderborg@xxxxxxxx> wrote: >> On 4/21/21 8:28 PM, Shakeel Butt wrote: >>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg >>> <peter.enderborg@xxxxxxxx> wrote: >>>> On 4/20/21 3:44 AM, Shakeel Butt wrote: >>> [...] >>>> I think this is the wrong way to go. >>> Which one? Are you talking about the kernel one? We already talked out >>> of that. To decide to OOM, we need to look at a very diverse set of >>> metrics and it seems like that would be very hard to do flexibly >>> inside the kernel. >> You dont need to decide to oom, but when oom occurs you >> can take a proper action. > No, we want the flexibility to decide when to oom-kill. Kernel is very > conservative in triggering the oom-kill. It wont do it for you. We use this code to solve that: /* * lowmemorykiller_oom * * Author: Peter Enderborg <peter.enderborg@xxxxxxxxxxxxxx> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. */ /* add fake print format with original module name */ #define pr_fmt(fmt) "lowmemorykiller: " fmt #include <linux/mm.h> #include <linux/slab.h> #include <linux/oom.h> #include <trace/events/lmk.h> #include "lowmemorykiller.h" #include "lowmemorykiller_tng.h" #include "lowmemorykiller_stats.h" #include "lowmemorykiller_tasks.h" /** * lowmemorykiller_oom_notify - OOM notifier * @self: notifier block struct * @notused: not used * @parm: returned - number of pages freed * * Return value: * NOTIFY_OK **/ static int lowmemorykiller_oom_notify(struct notifier_block *self, unsigned long notused, void *param) { struct lmk_rb_watch *lrw; unsigned long *nfreed = param; lowmem_print(2, "oom notify event\n"); *nfreed = 0; lmk_inc_stats(LMK_OOM_COUNT); spin_lock_bh(&lmk_task_lock); lrw = __lmk_task_first(); if (lrw) { struct task_struct *selected = lrw->tsk; struct lmk_death_pending_entry *ldpt; if (!task_trylock_lmk(selected)) { lmk_inc_stats(LMK_ERROR); lowmem_print(1, "Failed to lock task.\n"); lmk_inc_stats(LMK_BUSY); goto unlock_out; } get_task_struct(selected); /* move to kill pending set */ ldpt = kmem_cache_alloc(lmk_dp_cache, GFP_ATOMIC); /* if we fail to alloc we ignore the death pending list */ if (ldpt) { ldpt->tsk = selected; __lmk_death_pending_add(ldpt); } else { WARN_ON(1); lmk_inc_stats(LMK_MEM_ERROR); trace_lmk_sigkill(selected->pid, selected->comm, LMK_TRACE_MEMERROR, *nfreed, 0); } if (!__lmk_task_remove(selected, lrw->key)) WARN_ON(1); spin_unlock_bh(&lmk_task_lock); *nfreed = get_task_rss(selected); send_sig(SIGKILL, selected, 0); LMK_TAG_TASK_DIE(selected); trace_lmk_sigkill(selected->pid, selected->comm, LMK_TRACE_OOMKILL, *nfreed, 0); task_unlock(selected); put_task_struct(selected); lmk_inc_stats(LMK_OOM_KILL_COUNT); goto out; } unlock_out: spin_unlock_bh(&lmk_task_lock); out: return NOTIFY_OK; } static struct notifier_block lowmemorykiller_oom_nb = { .notifier_call = lowmemorykiller_oom_notify }; int __init lowmemorykiller_register_oom_notifier(void) { register_oom_notifier(&lowmemorykiller_oom_nb); return 0; } So what needed is a function that knows the priority. Here __lmk_task_first() that is from a rb-tree. You can pick what ever priority you like. In our case it is a android so it is a strictly oom_adj order in the tree. I think you can do the same with a old lowmemmorykiller style with a full task scan too. > [...] >>> Actually no. It is missing the flexibility to monitor metrics which a >>> user care and based on which they decide to trigger oom-kill. Not sure >>> how will watchdog replace psi/vmpressure? Userspace keeps petting the >>> watchdog does not mean that system is not suffering. >> The userspace should very much do what it do. But when it >> does not do what it should do, including kick the WD. Then >> the kernel kicks in and kill a pre defined process or as many >> as needed until the monitoring can start to kick and have the >> control. >> > Roman already suggested something similar (i.e. oom-killer core and > extended and core watching extended) but completely in userspace. I > don't see why we would want to do that in the kernel instead. A watchdog in kernel will work if userspace is completely broken or staved with low on memory. > >>> In addition oom priorities change dynamically and changing it in your >>> system seems very hard. Cgroup awareness is missing too. >> Why is that hard? Moving a object in a rb-tree is as good it get. >> > It is a group of objects. Anyways that is implementation detail. > > The message I got from this exchange is that we can have a watchdog > (userspace or kernel) to further improve the reliability of userspace > oom-killers.