Hi, I received some reports of mkfs.ext4 on an SD card triggering the OOM killer on a swapless system with a low amount of free memory. I have made a reproducible setup of this by using null_blk with a large completion delay. The problem appears to be that we do not throttle to the level of vm_dirty_ratio in throttle_vm_writeout(). I configure null_blk like this: null_blk.irqmode=2 null_blk.completion_nsec=30000000 null_blk.queue_mode=1 and run the test under a qemu-kvm instance. The kernel is 4.2-rc5 along with the patch I posted earlier today which fixes the initial dirty limit after the cgroups changes (https://lkml.org/lkml/2015/8/5/650). The problem is not related to the recent writeback cgroups changes though; earlier kernels show the same behaviour. /proc/meminfo looks like this just before run of mkfs.ext4 MemTotal: 197032 kB MemFree: 4184 kB MemAvailable: 14376 kB Buffers: 332 kB Cached: 64528 kB SwapCached: 0 kB Active: 2840 kB Inactive: 62612 kB Active(anon): 640 kB Inactive(anon): 61448 kB Active(file): 2200 kB Inactive(file): 1164 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 28 kB Writeback: 0 kB AnonPages: 660 kB Mapped: 2776 kB Shmem: 61456 kB Slab: 26580 kB SReclaimable: 12884 kB SUnreclaim: 13696 kB KernelStack: 576 kB PageTables: 220 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 98516 kB Committed_AS: 63424 kB VmallocTotal: 34359738367 kB VmallocUsed: 68532 kB VmallocChunk: 34359667708 kB DirectMap4k: 16256 kB DirectMap2M: 208896 kB DirectMap1G: 0 kB And mkfs.ext4 runs like this: # mkfs.ext4 -F -O ^extent -E lazy_itable_init=0,lazy_journal_init=0 /dev/nullb0: mke2fs 1.42.13 (17-May-2015) Creating filesystem with 65536000 4k blocks and 16384000 inodes Filesystem UUID: 970d9572-67d4-4a1f-a4d0-8376c6235c6a Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Allocating group tables: done Writing inode tables: [ 4.557641] mkfs.ext4 invoked oom-killer: gfp_mask=0x10200d0, order=0, oom_score_adj=0 [ 4.558395] mkfs.ext4 cpuset=/ mems_allowed=0 [ 4.558814] CPU: 0 PID: 667 Comm: mkfs.ext4 Not tainted 4.2.0-rc5+ #292 [ 4.559391] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 4.562285] Call Trace: [ 4.562509] [<ffffffff81515c64>] dump_stack+0x4f/0x7b [ 4.562957] [<ffffffff81513eb8>] dump_header.isra.9+0x76/0x38f [ 4.563474] [<ffffffff8109d95d>] ? trace_hardirqs_on+0xd/0x10 [ 4.564023] [<ffffffff8151ca2a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80 [ 4.564614] [<ffffffff81123cac>] oom_kill_process+0x38c/0x470 [ 4.565190] [<ffffffff81057cb5>] ? has_ns_capability_noaudit+0x5/0x160 [ 4.573104] [<ffffffff8107bc51>] ? get_parent_ip+0x11/0x50 [ 4.573608] [<ffffffff81124145>] ? out_of_memory+0x355/0x420 [ 4.574143] [<ffffffff811241a7>] out_of_memory+0x3b7/0x420 [ 4.574634] [<ffffffff81128dc0>] ? __alloc_pages_nodemask+0x810/0xb10 [ 4.575203] [<ffffffff81128fc6>] __alloc_pages_nodemask+0xa16/0xb10 [ 4.575755] [<ffffffff8111fd91>] pagecache_get_page+0x101/0x1c0 [ 4.576287] [<ffffffff811a2cf0>] ? I_BDEV+0x20/0x20 [ 4.576722] [<ffffffff8112092d>] grab_cache_page_write_begin+0x2d/0x50 [ 4.577298] [<ffffffff811a114d>] block_write_begin+0x2d/0x80 [ 4.577809] [<ffffffff811a34f5>] ? blkdev_write_begin+0x5/0x30 [ 4.578325] [<ffffffff811a3513>] blkdev_write_begin+0x23/0x30 [ 4.578830] [<ffffffff81120abf>] generic_perform_write+0xaf/0x1b0 [ 4.579368] [<ffffffff81121c60>] __generic_file_write_iter+0x190/0x1f0 [ 4.579975] [<ffffffff811a3b28>] blkdev_write_iter+0x78/0x100 [ 4.580497] [<ffffffff81167daa>] __vfs_write+0xaa/0xe0 [ 4.580984] [<ffffffff811681a7>] vfs_write+0x97/0x100 [ 4.581434] [<ffffffff811689a7>] SyS_pwrite64+0x77/0x90 [ 4.581923] [<ffffffff8151d3ae>] entry_SYSCALL_64_fastpath+0x12/0x76 [ 4.585984] Mem-Info: [ 4.586558] active_anon:207 inactive_anon:15363 isolated_anon:0 [ 4.586558] active_file:295 inactive_file:347 isolated_file:0 [ 4.586558] unevictable:0 dirty:0 writeback:0 unstable:0 [ 4.586558] slab_reclaimable:3260 slab_unreclaimable:3452 [ 4.586558] mapped:288 shmem:15363 pagetables:56 bounce:0 [ 4.586558] free:1112 free_pcp:32 free_cma:0 [ 4.589928] DMA free:1032kB min:144kB low:180kB high:216kB active_anon:52kB inactive_anon:5920kB active_file:48kB inactive_file:128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:4kB writeback:0kB mapped:88kB shmem:5920kB slab_reclaimable:164kB slab_unreclaimable:648kB kernel_stack:16kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 4.594056] lowmem_reserve[]: 0 173 173 173 [ 4.596036] DMA32 free:3136kB min:1608kB low:2008kB high:2412kB active_anon:776kB inactive_anon:55532kB active_file:1132kB inactive_file:1380kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:208768kB managed:181124kB mlocked:0kB dirty:0kB writeback:0kB mapped:1176kB shmem:55532kB slab_reclaimable:12876kB slab_unreclaimable:13160kB kernel_stack:592kB pagetables:196kB unstable:0kB bounce:0kB free_pcp:140kB local_pcp:140kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 4.600369] lowmem_reserve[]: 0 0 0 0 [ 4.601393] DMA: 14*4kB (UM) 20*8kB (UM) 20*16kB (UM) 13*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1016kB [ 4.602710] DMA32: 44*4kB (UEM) 128*8kB (UEM) 44*16kB (UEM) 24*32kB (UEM) 3*64kB (M) 2*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3120kB [ 4.604158] 16088 total pagecache pages [ 4.604509] 56190 pages RAM [ 4.604768] 0 pages HighMem/MovableOnly [ 4.605148] 6932 pages reserved [ 4.605437] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 4.606238] [ 75] 0 75 1134 16 8 3 0 0 klogd [ 4.607012] [ 140] 0 140 2667 140 10 3 0 0 dropbear [ 4.607809] [ 141] 0 141 1660 176 8 3 0 0 sh [ 4.608737] [ 661] 0 661 1134 167 8 3 0 0 exe [ 4.611211] [ 667] 0 667 4271 321 14 3 0 0 mkfs.ext4 [ 4.612198] Out of memory: Kill process 667 (mkfs.ext4) score 6 or sacrifice child [ 4.614630] Killed process 667 (mkfs.ext4) total-vm:17084kB, anon-rss:184kB, file-rss:1100kB Killed The last balance_dirty_pages shows that the task is within its ratelimit and so is not paused there: mkfs.ext4-664 [000] 4.773565: balance_dirty_pages: bdi 253:0: limit=9233 setpoint=4685 dirty=576 bdi_setpoint=4532 bdi_dirty=576 dirty_ratelimit=102400 task_ratelimit=181400 dirtied=58 dirtied_pause=512 paused=0 pause=-32 period=0 think=32 Then the page allocator gets called and starts direct reclaim: mkfs.ext4-664 [000] 4.773565: function: blkdev_write_begin <-- generic_perform_write mkfs.ext4-664 [000] 4.773566: function: __alloc_pages_nodemask <-- pagecache_get_page mkfs.ext4-664 [000] 4.773566: function: __alloc_pages_direct_compact <-- __alloc_pages_nodemask mkfs.ext4-664 [000] 4.773566: function: try_to_free_pages <-- __alloc_pages_nodemask mkfs.ext4-664 [000] 4.773567: mm_vmscan_direct_reclaim_begin: order=0 may_writepage=1 gfp_flags=GFP_USER However, even though NR_WRITEBACK is much larger than thresh, confgestion_wait() is not called because throttle_vm_writeout() uses the much higher domain->dirty_limit (limit) rather that vm_dirty_ratio (thresh): mkfs.ext4-664 [000] 4.773567: function: throttle_vm_writeout <-- shrink_lruvec.isra.59 mkfs.ext4-664 [000] 4.773567: global_dirty_state: dirty=3 writeback=573 unstable=0 bg_thresh=91 thresh=183 limit=9233 dirtied=3465 written=2889 None of the wait_iff_congested() calls have any effect since the bdi is not detected as being congested (queue/nr_request is the default of 128 and fewer than 30 requests are currently on it): mkfs.ext4-664 [000] 4.773583: function: wait_iff_congested <-- shrink_inactive_list mkfs.ext4-664 [000] 4.773583: writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0 try_to_free_pages() also keeps on waking up the flusher threads: mkfs.ext4-664 [000] 4.776026: kmalloc: (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d748498 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO mkfs.ext4-664 [000] 4.776026: writeback_queue: bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages mkfs.ext4-664 [000] 4.776032: kmalloc: (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d749188 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO mkfs.ext4-664 [000] 4.776033: writeback_queue: bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages mkfs.ext4-664 [000] 4.776040: kmalloc: (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d749620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO mkfs.ext4-664 [000] 4.776040: writeback_queue: bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages mkfs.ext4-664 [000] 4.776047: kmalloc: (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74a498 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO mkfs.ext4-664 [000] 4.776047: writeback_queue: bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages mkfs.ext4-664 [000] 4.776054: kmalloc: (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74a620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO mkfs.ext4-664 [000] 4.776054: writeback_queue: bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages mkfs.ext4-664 [000] 4.776061: kmalloc: (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74b310 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO mkfs.ext4-664 [000] 4.776061: writeback_queue: bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages mkfs.ext4-664 [000] 4.776068: kmalloc: (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74b620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO All to no avail since we never wait for any writes and end up OOMing after some rounds of direct reclaim: mkfs.ext4-664 [000] 4.776174: mm_vmscan_direct_reclaim_end: nr_reclaimed=0 mkfs.ext4-664 [000] 4.776192: function: out_of_memory <-- __alloc_pages_nodemask (A full ftrace is available at https://drive.google.com/file/d/0B4tMLbMvJ-l6MndXbG5wQmZrTm8/view) The problem seems to be that throttle_vm_writeout() does not respect the set vm_dirty_ratio values. With the following change, mkfs.ext4 is able to complete successfully every time in this scenario without triggering OOM: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5cccc12..47f1d09 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1916,7 +1920,6 @@ void throttle_vm_writeout(gfp_t gfp_mask) for ( ; ; ) { global_dirty_limits(&background_thresh, &dirty_thresh); - dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh); /* * Boost the allowable dirty threshold a bit for page But this is a revert of the following commit and presumably reintroduces the problem for the use case described there: commit 47a133339c332f9f8e155c70f5da401aded69948 Author: Fengguang Wu <fengguang.wu@xxxxxxxxx> Date: Wed Mar 21 16:34:09 2012 -0700 mm: use global_dirty_limit in throttle_vm_writeout() When starting a memory hog task, a desktop box w/o swap is found to go unresponsive for a long time. It's solely caused by lots of congestion waits in throttle_vm_writeout(): ... The root cause is, the dirty threshold is knocked down a lot by the memory hog task. Fixed by using global_dirty_limit which decreases gradually on such events and can guarantee we stay above (the also decreasing) nr_dirty in the progress of following down to the new dirty threshold. Comments? Thanks, /Rabin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>