TLDR ---- If a mempolicy or cpuset is in effect, out_of_memory() will select victim on specific node to kill. So that kernel can avoid accidental killing on NUMA system. Problem ------- Before this patch series, oom will only kill the process with the highest memory usage by selecting process with the highest oom_badness on the entire system. This works fine on UMA system, but may have some accidental killing on NUMA system. As shown below, if process c.out is bind to Node1 and keep allocating pages from Node1, a.out will be killed first. But killing a.out did't free any mem on Node1, so c.out will be killed then. A lot of AMD machines have 8 numa nodes. In these systems, there is a greater chance of triggering this problem. OOM before patches: ``` Per-node process memory usage (in MBs) PID Node 0 Node 1 Total ----------- ---------- ------------- -------- 3095 a.out 3073.34 0.11 3073.45(Killed first. Max mem usage) 3199 b.out 501.35 1500.00 2001.35 3805 c.out 1.52 (grow)2248.00 2249.52(Killed then. Node1 is full) ----------- ---------- ------------- -------- Total 3576.21 3748.11 7324.31 ``` Solution -------- We store per node rss in mm_rss_stat for each process. If a page allocation with mempolicy or cpuset in effect triger oom. We will calculate oom_badness with rss counter for the corresponding node. Then select the process with the highest oom_badness on the corresponding node to kill. OOM after patches: ``` Per-node process memory usage (in MBs) PID Node 0 Node 1 Total ----------- ---------- ------------- ---------- 3095 a.out 3073.34 0.11 3073.45 3199 b.out 501.35 1500.00 2001.35 3805 c.out 1.52 (grow)2248.00 2249.52(killed) ----------- ---------- ------------- ---------- Total 3576.21 3748.11 7324.31 ``` Overhead -------- CPU: According to the result of Unixbench. There is less than one percent performance loss in most cases. On 40c512g machine. 40 parallel copies of tests: +----------+----------+-----+----------+---------+---------+---------+ | numastat | FileCopy | ... | Pipe | Fork | syscall | total | +----------+----------+-----+----------+---------+---------+---------+ | off | 2920.24 | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 | | on | 2919.15 | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 | | overhead | 0.04% | ... | -0.39% | 2.12% | -3.95% | 0.28% | +----------+----------+-----+----------+---------+---------+---------+ 1 parallel copy of tests: +----------+----------+-----+---------+--------+---------+---------+ | numastat | FileCopy | ... | Pipe | Fork | syscall | total | +----------+----------+-----+---------+--------+---------+---------+ | off | 1515.37 | ... | 1473.97 | 546.88 | 1152.37 | 1671.2 | | on | 1508.09 | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 | | overhead | 0.48% | ... | 0.01% | 2.68% | 0.31% | 0.51% | +----------+----------+-----+---------+--------+---------+---------+ MEM: per task_struct: sizeof(int) * num_possible_nodes() + sizeof(int*) typically 4 * 2 + 8 bytes per mm_struct: sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*) typically 8 * 2 + 8 bytes zap_pte_range: sizeof(int) * num_possible_nodes() + sizeof(int*) typically 4 * 2 + 8 bytes Changelog ---------- v2: - enable per numa node oom for `CONSTRAINT_CPUSET`. - add benchmark result in cover letter. Gang Li (5): mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` mm: add numa_count field for rss_stat mm: add numa fields for tracepoint rss_stat mm: enable per numa node rss_stat count mm, oom: enable per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} arch/s390/mm/pgtable.c | 4 +- fs/exec.c | 2 +- fs/proc/base.c | 6 +- fs/proc/task_mmu.c | 14 ++-- include/linux/mm.h | 59 ++++++++++++----- include/linux/mm_types_task.h | 16 +++++ include/linux/oom.h | 2 +- include/trace/events/kmem.h | 27 ++++++-- kernel/events/uprobes.c | 6 +- kernel/fork.c | 70 +++++++++++++++++++- mm/huge_memory.c | 13 ++-- mm/khugepaged.c | 4 +- mm/ksm.c | 2 +- mm/madvise.c | 2 +- mm/memory.c | 119 ++++++++++++++++++++++++---------- mm/migrate.c | 4 ++ mm/migrate_device.c | 2 +- mm/oom_kill.c | 69 +++++++++++++++----- mm/rmap.c | 19 +++--- mm/swapfile.c | 6 +- mm/userfaultfd.c | 2 +- 21 files changed, 335 insertions(+), 113 deletions(-) -- 2.20.1