On Wed 08-07-20 09:24:09, Yafang Shao wrote: > Recently we found an issue on our production environment that when memcg > oom is triggered the oom killer doesn't chose the process with largest > resident memory but chose the first scanned process. Note that all > processes in this memcg have the same oom_score_adj, so the oom killer > should chose the process with largest resident memory. > > Bellow is part of the oom info, which is enough to analyze this issue. > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 > [...] > [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause > [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas > [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron > [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord > [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd > [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python > [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent > [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat > [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent > [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 > [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client > [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client > [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner > [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su > [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 > [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python > [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p > [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 > [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster > [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout > [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > We can find that the first scanned process 5740 (pause) was killed, but its > rss is only one page. That is because, when we calculate the oom badness in > oom_badness(), we always ignore the negtive point and convert all of these > negtive points to 1. Now as oom_score_adj of all the processes in this > targeted memcg have the same value -998, the points of these processes are > all negtive value. As a result, the first scanned process will be killed. Such a large bias can skew results quite considerably. > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a > a Guaranteed pod, which has higher priority to prevent from being killed by > system oom. This is really interesting! I assume that the oom_score_adj is set to protect from the global oom situation right? I am struggling to understand what is the expected behavior when the oom is internal for such a group though. Does killing a single task from such a group is a sensible choice? I am not really familiar with kubelet but can it cope with data_sim going away from under it while the rest would still run? Wouldn't it make more sense to simply tear down the whole thing? But that is a separate thing. > To fix this issue, we should make the calculation of oom point more > accurate. We can achieve it by convert the chosen_point from 'unsigned > long' to 'long'. oom_score has a very coarse units because it maps all the consumed memory into 0 - 1000 scale so effectively per-mille of the usable memory. oom_score_adj acts on top of that as a bias. This is exported to the userspace and I do not think we can change that (see Documentation/filesystems/proc.rst) unfortunately. So you patch cannot be really accepted as is because it would start reporting values outside of the allowed range unless I am doing some math incorrectly. On the other hand, in this particular case I believe the existing calculation is just wrong. Usable memory is 16777216kB (4194304 pages), the top consumer is 3976380 pages so 94.8% the lowest memory consumer is effectively 0%. Even if we discount 94.8% by 99.8% then we should be still having something like 7950 pages. So the normalization oom_badness does cuts results too aggressively. There was quite some churn in the calculation in the past fixing weird rounding bugs so I have to think about how to fix this properly some more. That being said, even though the configuration is weird I do agree that oom_badness scaling is really unexpected and the memory consumption in this particular example should be quite telling about who to chose as an oom victim. -- Michal Hocko SUSE Labs