On Thu 02-11-17 23:35:44, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > signals to the kernel that epoll_wait() and friends may not complete, > and the kernel may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > 16 bytes per process is kinda spendy, but I want to keep > lru behavior, which mem_score_adj does not allow. When a supervisor, > like Android's user input is keeping track this can be done in user-space. > It could be pulled out of task_struct if an cross-indexing additional > red-black tree is added to support pid-based lookup. This is still an abuse and the patch is wrong. We really do have an API to use I fail to see why you do not use it. [...] > @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row for current memcg or global. > + */ > + l = oom_target_get_queue(current); > + if (!list_empty(l)) { > + struct task_struct *ts = list_first_entry(l, > + struct task_struct, se.oom_target_queue); > + > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > + ts->pid); > + > + /* We use SIGKILL instead of the oom killer > + * so as to cleanly interrupt ep_poll() > + */ > + send_sig(SIGKILL, ts, 1); > + return true; > + } Still not NUMA aware and completely backwards. If this is a memcg OOM then it is _memcg_ to evaluate not the current. The oom might happen up the hierarchy due to hard limit. But still, you should be very clear _why_ the existing oom tuning is not appropropriate and we can think of a way to hanle it better but cramming the oom selection this way is simply not acceptable. -- Michal Hocko SUSE Labs