Michal Hocko wrote: > On Tue 05-12-17 23:48:21, David Rientjes wrote: > [...] > > I think this argues to do MMF_REAPING-style behavior at the beginning of > > exit_mmap() and avoid reaping all together once we have reached that > > point. There are no more users of the mm and we are in the process of > > tearing it down, I'm not sure that the oom reaper should be in the > > business with trying to interfere with that. Or are there actual bug > > reports where an oom victim gets wedged while in exit_mmap() prior to > > releasing its memory? > > Something like that seem to work indeed. But we should better understand > what is going on here before adding new oom reaper specific kludges. So > let's focus on getting more information from your crashes first. As of 968edbd93c0cbb40ab48aca972392d377713a0c3 on linux.git and using reproducer shown below, I got use after free bug which crashes the OOM reaper. ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sched.h> #include <sys/mman.h> #define NUMTHREADS 128 #define MMAPSIZE 128 * 1048576 #define STACKSIZE 4096 static int pipe_fd[2] = { EOF, EOF }; static int memory_eater(void *unused) { int fd = open("/dev/zero", O_RDONLY); char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, EOF, 0); read(pipe_fd[0], buf, 1); read(fd, buf, MMAPSIZE); pause(); return 0; } int main(int argc, char *argv[]) { int i; char *stack; if (fork() || fork() || setsid() == EOF || pipe(pipe_fd)) _exit(0); stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, EOF, 0); for (i = 0; i < NUMTHREADS; i++) if (clone(memory_eater, stack + (i + 1) * STACKSIZE, /*CLONE_THREAD | CLONE_SIGHAND | */CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1) break; sleep(1); close(pipe_fd[1]); pause(); return 0; } ---------- ---------- [ 100.740891] Out of memory: Kill process 1297 (a.out) score 668 or sacrifice child [ 100.746289] Killed process 1297 (a.out) total-vm:16781904kB, anon-rss:2124172kB, file-rss:0kB, shmem-rss:0kB [ 113.130943] ================================================================== [ 113.136627] BUG: KASAN: use-after-free in __oom_reap_task_mm+0x1ce/0x2a0 [ 113.141811] Read of size 8 at addr ffff880115144010 by task oom_reaper/17 [ 113.147505] [ 113.152112] CPU: 0 PID: 17 Comm: oom_reaper Not tainted 4.15.0-rc2+ #335 [ 113.157491] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 113.164088] Call Trace: [ 113.168691] dump_stack+0x7d/0xc0 [ 113.173176] print_address_description+0xc2/0x250 [ 113.178114] kasan_report+0x24a/0x360 [ 113.182736] ? __oom_reap_task_mm+0x1ce/0x2a0 [ 113.187562] __oom_reap_task_mm+0x1ce/0x2a0 [ 113.192339] ? rcu_read_unlock+0x60/0x60 [ 113.196957] ? find_held_lock+0xff/0x130 [ 113.201536] oom_reaper+0x108/0x240 [ 113.205939] ? wake_oom_reaper.part.16+0x60/0x60 [ 113.210575] ? pci_mmcfg_check_reserved+0xb0/0xb0 [ 113.215063] ? wait_woken+0x100/0x100 [ 113.219332] ? mark_held_locks+0x1b/0xb0 [ 113.223478] ? _raw_spin_unlock_irqrestore+0x2d/0x50 [ 113.227797] kthread+0x1c0/0x210 [ 113.231499] ? wake_oom_reaper.part.16+0x60/0x60 [ 113.235449] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 113.239624] ret_from_fork+0x24/0x30 [ 113.244269] [ 113.247570] Allocated by task 1296: [ 113.251019] kasan_kmalloc+0xa0/0xd0 [ 113.254414] kmem_cache_alloc+0xf4/0x1e0 [ 113.258214] copy_process.part.42+0x29a3/0x30c0 [ 113.261769] _do_fork+0x16e/0x700 [ 113.264792] do_syscall_64+0xe4/0x390 [ 113.267801] return_from_SYSCALL_64+0x0/0x75 [ 113.271002] [ 113.273394] Freed by task 1377: [ 113.276211] kasan_slab_free+0x71/0xc0 [ 113.279093] kmem_cache_free+0xaf/0x1e0 [ 113.281974] remove_vma+0x9d/0xb0 [ 113.284734] exit_mmap+0x179/0x250 [ 113.287651] mmput+0x7d/0x1b0 [ 113.290456] do_exit+0x408/0x1290 [ 113.293268] do_group_exit+0x84/0x140 [ 113.296109] get_signal+0x291/0x9b0 [ 113.298915] do_signal+0x8e/0xa70 [ 113.301637] exit_to_usermode_loop+0x71/0xb0 [ 113.304632] do_syscall_64+0x343/0x390 [ 113.307349] return_from_SYSCALL_64+0x0/0x75 [ 113.310205] [ 113.312388] The buggy address belongs to the object at ffff880115144008 [ 113.312388] which belongs to the cache vm_area_struct of size 200 [ 113.319286] The buggy address is located 8 bytes inside of [ 113.319286] 200-byte region [ffff880115144008, ffff8801151440d0) [ 113.325766] The buggy address belongs to the page: [ 113.328735] page:0000000057390752 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 [ 113.332958] flags: 0x2fffff80008100(slab|head) [ 113.335835] raw: 002fffff80008100 0000000000000000 0000000000000000 00000001000e000e [ 113.339567] raw: ffffea0002ec44a0 ffffea000422d7a0 ffff8801170f13c0 0000000000000000 [ 113.343304] page dumped because: kasan: bad access detected [ 113.346506] [ 113.348706] Memory state around the buggy address: [ 113.351974] ffff880115143f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 113.355815] ffff880115143f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 113.359479] >ffff880115144000: fc fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 113.363128] ^ [ 113.365969] ffff880115144080: fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc [ 113.369672] ffff880115144100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 113.373402] ================================================================== [ 113.377127] Disabling lock debugging due to kernel taint [ 113.380915] ------------[ cut here ]------------ [ 113.383920] kernel BUG at mm/memory.c:1502! [ 113.386829] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN [ 113.390214] Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 coretemp ip6t_REJECT nf_reject_ipv6 xt_conntrack vmw_balloon pcspkr ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sg iptable_mangle iptable_raw vmw_vmci shpchp i2c_piix4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod serio_raw pata_acpi vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi ahci scsi_transport_spi drm libahci mptscsih ata_piix e1000 i2c_core mptbase libata [ 113.416884] CPU: 0 PID: 17 Comm: oom_reaper Tainted: G B 4.15.0-rc2+ #335 [ 113.421312] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 113.426490] RIP: 0010:unmap_page_range+0xd8b/0xdb0 [ 113.430267] RSP: 0018:ffff88011254fbc8 EFLAGS: 00010282 [ 113.434113] RAX: 0000000000000000 RBX: 1ffff100224a9fa8 RCX: 00007f8e4c99e000 [ 113.438504] RDX: ffff880115144cf8 RSI: 1ffff100224a9f94 RDI: ffff88011254fd60 [ 113.442915] RBP: ffff88011082e340 R08: 0000000000000000 R09: 0000000000000000 [ 113.447315] R10: ffffed0021d9cc00 R11: fffffbfff14eaeb4 R12: ffff880115144680 [ 113.451730] R13: ffff88010bc745c0 R14: ffff88011082e3f0 R15: ffff880115144688 [ 113.456150] FS: 0000000000000000(0000) GS:ffff880117600000(0000) knlGS:0000000000000000 [ 113.460905] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 113.465052] CR2: 00007fca30eb0000 CR3: 000000011f214003 CR4: 00000000000606f0 [ 113.469633] Call Trace: [ 113.473028] ? release_pages+0x46a/0x580 [ 113.476861] ? __put_compound_page+0x60/0x60 [ 113.480721] ? lru_add_drain_cpu+0xa2/0x1a0 [ 113.484569] ? lru_add_drain+0xc/0x10 [ 113.488257] ? free_pages_and_swap_cache+0x93/0x100 [ 113.492288] ? vm_normal_page_pmd+0x160/0x160 [ 113.496155] ? tlb_flush_mmu_free+0x73/0x80 [ 113.499954] ? arch_tlb_finish_mmu+0x68/0xa0 [ 113.503761] __oom_reap_task_mm+0x1c6/0x2a0 [ 113.507580] ? rcu_read_unlock+0x60/0x60 [ 113.511259] ? find_held_lock+0xff/0x130 [ 113.514925] oom_reaper+0x108/0x240 [ 113.518423] ? wake_oom_reaper.part.16+0x60/0x60 [ 113.522199] ? pci_mmcfg_check_reserved+0xb0/0xb0 [ 113.525933] ? wait_woken+0x100/0x100 [ 113.529393] ? mark_held_locks+0x1b/0xb0 [ 113.532923] ? _raw_spin_unlock_irqrestore+0x2d/0x50 [ 113.536738] kthread+0x1c0/0x210 [ 113.540043] ? wake_oom_reaper.part.16+0x60/0x60 [ 113.543673] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 113.547440] ret_from_fork+0x24/0x30 [ 113.550848] Code: 24 10 e8 29 75 01 00 e9 12 fd ff ff 48 8b bc 24 b0 00 00 00 e8 c7 74 01 00 e9 de f3 ff ff 48 89 cf e8 ba de ff ff e9 78 fc ff ff <0f> 0b 0f 0b 48 8b 7c 24 18 e8 47 75 01 00 e9 c2 fc ff ff e8 9d [ 113.561566] RIP: unmap_page_range+0xd8b/0xdb0 RSP: ffff88011254fbc8 [ 113.565852] ---[ end trace 80b64d1cae13d405 ]--- mmput+0x7d/0x1b0: __mmput at kernel/fork.c:925 (inlined by) mmput at kernel/fork.c:945 exit_mmap+0x179/0x250: exit_mmap at mm/mmap.c:3046 remove_vma+0x9d/0xb0: remove_vma at mm/mmap.c:178 kmem_cache_free+0xaf/0x1e0: slab_free at mm/slub.c:2973 (inlined by) kmem_cache_free at mm/slub.c:2990 ---------- What we overlooked is the fact that "it is not always the process which got ->signal->oom_mm set, it is any thread which called mmput() which invoked __mmput() path". Therefore, below patch fixes oops in my case. If some unrelated kernel thread was holding mm_users ref, it is possible that we miss down_write()/up_write() synchronization. ---------- diff --git a/mm/mmap.c b/mm/mmap.c index a4d5468..2dd813e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3020,7 +3020,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); set_bit(MMF_OOM_SKIP, &mm->flags); - if (unlikely(tsk_is_oom_victim(current))) { + if (1) { /* * Wait for oom_reap_task() to stop working on this * mm. Because MMF_OOM_SKIP is already set before ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>