On Wed, May 11, 2022 at 9:47 PM Gang Li <ligang.bdlg@xxxxxxxxxxxxx> wrote: > > TLDR: > If a mempolicy is in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY), out_of_memory() will > select victim on specific node to kill. So that kernel can avoid accidental killing on NUMA system. > > Problem: > Before this patch series, oom will only kill the process with the highest memory usage. > by selecting process with the highest oom_badness on the entire system to kill. > > This works fine on UMA system, but may have some accidental killing on NUMA system. > > As shown below, if process c.out is bind to Node1 and keep allocating pages from Node1, > a.out will be killed first. But killing a.out did't free any mem on Node1, so c.out > will be killed then. > > A lot of our AMD machines have 8 numa nodes. In these systems, there is a greater chance > of triggering this problem. > > OOM before patches: > ``` > Per-node process memory usage (in MBs) > PID Node 0 Node 1 Total > ----------- ---------- ------------- ---------- > 3095 a.out 3073.34 0.11 3073.45(Killed first. Maximum memory consumption) > 3199 b.out 501.35 1500.00 2001.35 > 3805 c.out 1.52 (grow)2248.00 2249.52(Killed then. Node1 is full) > ----------- ---------- ------------- ---------- > Total 3576.21 3748.11 7324.31 > ``` > > Solution: > We store per node rss in mm_rss_stat for each process. > > If a page allocation with mempolicy in effect(oc->constraint == CONSTRAINT_MEMORY_POLICY) > triger oom. We will calculate oom_badness with rss counter for the corresponding node. Then > select the process with the highest oom_badness on the corresponding node to kill. > > OOM after patches: > ``` > Per-node process memory usage (in MBs) > PID Node 0 Node 1 Total > ----------- ---------- ------------- ---------- > 3095 a.out 3073.34 0.11 3073.45 > 3199 b.out 501.35 1500.00 2001.35 > 3805 c.out 1.52 (grow)2248.00 2249.52(killed) > ----------- ---------- ------------- ---------- > Total 3576.21 3748.11 7324.31 > ``` You included lots of people but missed Michal Hocko. CC'ing him and please include him in the future postings. > > Gang Li (5): > mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` > mm: add numa_count field for rss_stat > mm: add numa fields for tracepoint rss_stat > mm: enable per numa node rss_stat count > mm, oom: enable per numa node oom for CONSTRAINT_MEMORY_POLICY > > arch/s390/mm/pgtable.c | 4 +- > fs/exec.c | 2 +- > fs/proc/base.c | 6 +- > fs/proc/task_mmu.c | 14 ++-- > include/linux/mm.h | 59 ++++++++++++----- > include/linux/mm_types_task.h | 16 +++++ > include/linux/oom.h | 2 +- > include/trace/events/kmem.h | 27 ++++++-- > kernel/events/uprobes.c | 6 +- > kernel/fork.c | 70 +++++++++++++++++++- > mm/huge_memory.c | 13 ++-- > mm/khugepaged.c | 4 +- > mm/ksm.c | 2 +- > mm/madvise.c | 2 +- > mm/memory.c | 116 ++++++++++++++++++++++++---------- > mm/migrate.c | 2 + > mm/migrate_device.c | 2 +- > mm/oom_kill.c | 59 ++++++++++++----- > mm/rmap.c | 16 ++--- > mm/swapfile.c | 4 +- > mm/userfaultfd.c | 2 +- > 21 files changed, 317 insertions(+), 111 deletions(-) > > -- > 2.20.1 >