On 2019/01/16 20:09, Michal Hocko wrote: > On Wed 16-01-19 19:55:21, Tetsuo Handa wrote: >> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure >> processes sharing mm have same view of oom_score_adj") and commit >> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to >> close a race and reduce the latency at __set_oom_adj(), and reduces the >> warning at __oom_kill_process() in order to minimize the latency. >> >> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed >> to unmap the address space") introduced the worst case mentioned in >> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set, >> only administrators can trigger the worst case. >> >> Since 44a70adec910d692 did not take latency into account, we can hold RCU >> for minutes and trigger RCU stall warnings by calling printk() on many >> thousands of thread groups. Even without calling printk(), the latency is >> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is >> racy, and trying to fix the race will require a global lock which is too >> costly for rare events. >> >> If the worst case in 44a70adec910d692 happens, it is an administrator's >> request. Therefore, tolerate the worst case and speed up __set_oom_adj(). > > I really do not think we care about latency. I consider the overal API > sanity much more important. Besides that the original report you are > referring to was never exaplained/shown to represent real world usecase. > oom_score_adj is not really a an interface to be tweaked in hot paths. I do care about the latency. Holding RCU for more than 2 minutes is insane. ---------- #define _GNU_SOURCE #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sched.h> #include <sys/mman.h> #include <signal.h> #define STACKSIZE 8192 static int child(void *unused) { pause(); return 0; } int main(int argc, char *argv[]) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); int i; char *stack = mmap(NULL, STACKSIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, EOF, 0); for (i = 0; i < 8192 * 4; i++) if (clone(child, stack + STACKSIZE, CLONE_VM, NULL) == -1) break; write(fd, "0\n", 2); kill(0, SIGSEGV); return 0; } ---------- > > I can be convinced otherwise but that really requires some _real_ > usecase with an explanation why there is no other way. Until then > > Nacked-by: Michal Hocko <mhocko@xxxxxxxx>