Re: [RFC PATCH] mm, oom: disable dump_tasks by default

Qian Cai <cai@xxxxxx> · Fri, 06 Sep 2019 09:08:55 -0400

On Fri, 2019-09-06 at 19:32 +0900, Tetsuo Handa wrote:
> On 2019/09/06 6:21, Qian Cai wrote:
> > On Fri, 2019-09-06 at 05:59 +0900, Tetsuo Handa wrote:
> > > On 2019/09/06 1:10, Qian Cai wrote:
> > > > On Tue, 2019-09-03 at 17:13 +0200, Michal Hocko wrote:
> > > > > On Tue 03-09-19 11:02:46, Qian Cai wrote:
> > > > > > Well, I still see OOM sometimes kills wrong processes like ssh, systemd
> > > > > > processes while LTP OOM tests with staight-forward allocation patterns.
> > > > > 
> > > > > Please report those. Most cases I have seen so far just turned out to
> > > > > work as expected and memory hogs just used oom_score_adj or similar.
> > > > 
> > > > Here is the one where oom01 should be one to be killed.
> > > 
> > > I assume that there are previous OOM killer events before
> > > 
> > > > 
> > > > [92598.855697][ T2588] Swap cache stats: add 105240923, delete 105250445, find
> > > > 42196/101577
> > > 
> > > line. Please be sure to include.
> > 
> > 12:00:52 oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice,task=oom01,pid=25507,uid=0
> > 12:00:52 Out of memory: Killed process 25507(oom01) total-vm:6324780kB, anon-rss:5647168kB, file-rss:0kB, shmem-rss:0kB,UID:0 pgtables:11395072kB oom_score_adj:0
> > 12:00:52 oom_reaper: reaped process 25507(oom01), now anon-rss:5647452kB, file-rss:0kB, shmem-rss:0kB
> > 12:00:52 irqbalance invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> > (...snipped...)
> > 12:00:53 [  25391]     0 25391     2184        0    65536       32             0 oom01
> > 12:00:53 [  25392]     0 25392     2184        0    65536       39             0 oom01
> > 12:00:53 oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/tuned.service,task=tuned,pid=2629,uid=0
> > 12:00:54 Out of memory: Killed process 2629(tuned) total-vm:424936kB, anon-rss:328kB, file-rss:1268kB, shmem-rss:0kB, UID:0 pgtables:442368kB oom_score_adj:0
> > 12:00:54 oom_reaper: reaped process 2629 (tuned), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> OK. anon-rss did not decrease when oom_reaper gave up.
> I think this is same with https://lkml.org/lkml/2017/7/28/317 case.
> 
> The OOM killer does not wait for OOM victims until existing OOM victims release
> memory by calling exit_mmap(). The OOM killer selects next OOM victim as soon as
> the OOM reaper sets MMF_OOM_SKIP. As a result, when the OOM reaper failed to
> reclaim memory due to e.g. mlocked pages, the OOM killer immediately selects next
> OOM victim. But since 25391 and 25392 are consuming little memory (maybe these are
> already reaped OOM victims), neither 25391 nor 25392 was selected as next OOM victim.
> 

Yes, mlocked is troublesome. I have other incidents where crond and systemd-
udevd were killed by mistake, and it even tried to kill kworker/0.

https://cailca.github.io/files/dmesg.txt