On Wed 04-07-18 00:14:39, Petros Angelatos wrote: > Hello, > > I'm facing a strange problem when I constrain an IO intensive > application that generates a lot of dirty pages inside a v1 cgroup > with a memory controller. After a while the OOM killer kicks in and > kills the processes instead of throttling the allocations while dirty > pages are being flushed. Here is a test program that reproduces the > issue: > > cd /sys/fs/cgroup/memory/ > mkdir dirty-test > echo 10485760 > dirty-test/memory.limit_in_bytes > > echo $$ > dirty-test/cgroup.procs > > rm /mnt/file_* > for i in $(seq 500); do > dd if=/dev/urandom count=2048 of="/mnt/file_$i" > done > > When a process gets killed I get the following trace in dmesg: > > > foo.sh invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0 > > foo.sh cpuset=/ mems_allowed=0 > > CPU: 0 PID: 18415 Comm: foo.sh Tainted: P O 4.17.2-1-ARCH #1 > > Hardware name: LENOVO 20F9CTO1WW/20F9CTO1WW, BIOS N1CET52W (1.20 ) 11/30/2016 [...] > > Task in /dirty-test killed as a result of limit of /dirty-test > > memory: usage 10240kB, limit 10240kB, failcnt 13073 > > memory+swap: usage 10240kB, limit 9007199254740988kB, failcnt 0 > > kmem: usage 1308kB, limit 9007199254740988kB, failcnt 0 > > Memory cgroup stats for /dirty-test: cache:8848KB rss:180KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:8580KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:200KB inactive_file:4364KB active_file:4364KB unevictable:0KB > > [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > > [18160] 0 18160 3468 652 73728 0 0 foo.sh > > [18415] 0 18415 3468 118 61440 0 0 foo.sh > > Memory cgroup out of memory: Kill process 18160 (foo.sh) score 261 or sacrifice child > > Killed process 18415 (foo.sh) total-vm:13872kB, anon-rss:472kB, file-rss:0kB, shmem-rss:0kB > > oom_reaper: reaped process 18415 (foo.sh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > The cgroup v2 documentation mentions that the OOM killer will be only > be invoked when the out of memory situation happens inside a page > fault, and this problem is always happening during a page fault so > that's not surprising but I'm not sure why the process ends up in a > fatal page fault. I assume dd just tried to fault a code page in and that failed due to the hard limit and unreclaimable memory. The reason why the memcg v1 oom throttling heuristic hasn't kicked in is that there are no pages under writeback. This would match symptoms of the bug fixed by 1c610d5f93c7 ("mm/vmscan: wake up flushers for legacy cgroups too") in 4.16 but there might be more. You should have that fix already so there must be something more in the game. You've said that you are using blkio cgroup, right? What is the configuration? I strongly suspect that none of the writeback has started because of the throttling. -- Michal Hocko SUSE Labs