How to make warn_alloc() reliable?

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Tue, 18 Oct 2016 20:04:20 +0900

Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
too long") is a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls. For example, below is a
report where write() request got stuck because it cannot invoke the
OOM killer due to GFP_NOFS allocation request.

---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-xfs-loop.txt.xz ----------
[  351.824548] oom_reaper: reaped process 4727 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  362.309509] warn_alloc: 96 callbacks suppressed
(...snipped...)
[  707.833650] a.out: page alloction stalls for 370009ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  707.833653] CPU: 3 PID: 4746 Comm: a.out Tainted: G        W       4.9.0-rc1+ #80
[  707.833653] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  707.833656]  ffffc90006d27950 ffffffff812e9777 ffffffff8197c438 0000000000000001
[  707.833657]  ffffc90006d279d8 ffffffff8112a114 0342004a7fffd720 ffffffff8197c438
[  707.833658]  ffffc90006d27978 ffffffff00000010 ffffc90006d279e8 ffffc90006d27998
[  707.833658] Call Trace:
[  707.833662]  [<ffffffff812e9777>] dump_stack+0x4f/0x68
[  707.833665]  [<ffffffff8112a114>] warn_alloc+0x144/0x160
[  707.833666]  [<ffffffff8112aad6>] __alloc_pages_nodemask+0x936/0xe80
[  707.833670]  [<ffffffff81177f07>] alloc_pages_current+0x87/0x110
[  707.833672]  [<ffffffff8111f33c>] __page_cache_alloc+0xdc/0x120
[  707.833673]  [<ffffffff8111fe58>] pagecache_get_page+0x88/0x2b0
[  707.833675]  [<ffffffff81120f5b>] grab_cache_page_write_begin+0x1b/0x40
[  707.833677]  [<ffffffff812036ab>] iomap_write_begin+0x4b/0x100
[  707.833678]  [<ffffffff81203932>] iomap_write_actor+0xb2/0x190
[  707.833680]  [<ffffffff81285dcb>] ? xfs_trans_commit+0xb/0x10
[  707.833681]  [<ffffffff81203880>] ? iomap_write_end+0x70/0x70
[  707.833682]  [<ffffffff81203f5e>] iomap_apply+0xae/0x130
[  707.833683]  [<ffffffff81204043>] iomap_file_buffered_write+0x63/0xa0
[  707.833684]  [<ffffffff81203880>] ? iomap_write_end+0x70/0x70
[  707.833686]  [<ffffffff8126bd0f>] xfs_file_buffered_aio_write+0xcf/0x1f0
[  707.833689]  [<ffffffff816152a8>] ? _raw_spin_lock_irqsave+0x18/0x40
[  707.833690]  [<ffffffff81615053>] ? _raw_spin_unlock_irqrestore+0x13/0x30
[  707.833692]  [<ffffffff8126beb5>] xfs_file_write_iter+0x85/0x120
[  707.833694]  [<ffffffff811a802d>] __vfs_write+0xdd/0x140
[  707.833695]  [<ffffffff811a8c7d>] vfs_write+0xad/0x1a0
[  707.833697]  [<ffffffff810021f0>] ? syscall_trace_enter+0x1b0/0x240
[  707.833698]  [<ffffffff811aa090>] SyS_write+0x50/0xc0
[  707.833700]  [<ffffffff811d6b78>] ? do_fsync+0x38/0x60
[  707.833701]  [<ffffffff8100241c>] do_syscall_64+0x5c/0x170
[  707.833702]  [<ffffffff81615786>] entry_SYSCALL64_slow_path+0x25/0x25
[  707.833703] Mem-Info:
[  707.833706] active_anon:451061 inactive_anon:2097 isolated_anon:0
[  707.833706]  active_file:13 inactive_file:115 isolated_file:27
[  707.833706]  unevictable:0 dirty:80 writeback:1 unstable:0
[  707.833706]  slab_reclaimable:3291 slab_unreclaimable:21028
[  707.833706]  mapped:416 shmem:2162 pagetables:3734 bounce:0
[  707.833706]  free:13182 free_pcp:125 free_cma:0
[  707.833708] Node 0 active_anon:1804244kB inactive_anon:8388kB active_file:52kB inactive_file:460kB unevictable:0kB isolated(anon):0kB isolated(file):108kB mapped:1664kB dirty:320kB writeback:4kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1472512kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:1255 all_unreclaimable? yes
[  707.833710] Node 0 DMA free:8192kB min:352kB low:440kB high:528kB active_anon:7656kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  707.833712] lowmem_reserve[]: 0 1963 1963 1963
[  707.833714] Node 0 DMA32 free:44536kB min:44700kB low:55872kB high:67044kB active_anon:1796588kB inactive_anon:8388kB active_file:52kB inactive_file:460kB unevictable:0kB writepending:324kB present:2080640kB managed:2010816kB mlocked:0kB slab_reclaimable:13164kB slab_unreclaimable:84080kB kernel_stack:7312kB pagetables:14912kB bounce:0kB free_pcp:500kB local_pcp:168kB free_cma:0kB
[  707.833715] lowmem_reserve[]: 0 0 0 0
[  707.833720] Node 0 DMA: 4*4kB (M) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (M) 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8192kB
[  707.833725] Node 0 DMA32: 4*4kB (UME) 41*8kB (MEH) 692*16kB (UME) 653*32kB (UME) 135*64kB (UMEH) 16*128kB (UMH) 2*256kB (H) 0*512kB 1*1024kB (H) 0*2048kB 0*4096kB = 44536kB
[  707.833726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  707.833727] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  707.833727] 2317 total pagecache pages
[  707.833728] 0 pages in swap cache
[  707.833729] Swap cache stats: add 0, delete 0, find 0/0
[  707.833729] Free swap  = 0kB
[  707.833729] Total swap = 0kB
[  707.833730] 524157 pages RAM
[  707.833730] 0 pages HighMem/MovableOnly
[  707.833730] 17477 pages reserved
[  707.833730] 0 pages hwpoisoned
---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-xfs-loop.txt.xz ----------

But that commit does not cover all possibilities caused by memory
allocation stalls. For example, without below patch,

----------

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926..bbd0769 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1554,7 +1554,7 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
+static long too_many_isolated(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
@@ -1581,7 +1581,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
-	return isolated > inactive;
+	return isolated - inactive;
 }
 
 static noinline_for_stack void
@@ -1697,11 +1697,25 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	unsigned long wait_start = jiffies;
+	unsigned int wait_timeout = 10 * HZ;
+	long last_diff = 0;
+	long diff;
 
 	if (!inactive_reclaimable_pages(lruvec, sc, lru))
 		return 0;
 
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
+	while (unlikely((diff = too_many_isolated(pgdat, file, sc)) > 0)) {
+		if (diff < last_diff) {
+			wait_start = jiffies;
+			wait_timeout = 10 * HZ;
+		} else if (time_after(jiffies, wait_start + wait_timeout)) {
+			warn_alloc(sc->gfp_mask,
+				   "shrink_inactive_list() stalls for %ums",
+				   jiffies_to_msecs(jiffies - wait_start));
+			wait_timeout += 10 * HZ;
+		}
+		last_diff = diff;
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
----------

we cannot report a OOM livelock (shown below) where all ___GFP_DIRECT_RECLAIM
allocation requests got stuck at too_many_isolated() from shrink_inactive_list()
waiting for kswapd which got stuck waiting for a lock.

---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-shrink-loop.txt.xz ----------
[  853.591933] oom_reaper: reaped process 7091 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  888.994101] a.out: shrink_inactive_list() stalls for 10032ms, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  888.996601] CPU: 2 PID: 7107 Comm: a.out Tainted: G        W       4.9.0-rc1+ #80
[  888.998543] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  889.001075]  ffffc90007ecf788 ffffffff812e9777 ffffffff8197ce18 0000000000000000
[  889.003140]  ffffc90007ecf810 ffffffff8112a114 024201ca00000064 ffffffff8197ce18
[  889.005216]  ffffc90007ecf7b0 ffffc90000000010 ffffc90007ecf820 ffffc90007ecf7d0
[  889.007289] Call Trace:
[  889.008295]  [<ffffffff812e9777>] dump_stack+0x4f/0x68
[  889.009842]  [<ffffffff8112a114>] warn_alloc+0x144/0x160
[  889.011389]  [<ffffffff810a4b40>] ? wake_up_bit+0x30/0x30
[  889.012956]  [<ffffffff81137af3>] shrink_inactive_list+0x593/0x5a0
[  889.014659]  [<ffffffff81138389>] shrink_node_memcg+0x509/0x7b0
[  889.016330]  [<ffffffff811ab200>] ? super_cache_count+0x30/0xd0
[  889.018008]  [<ffffffff8113870c>] shrink_node+0xdc/0x320
[  889.019564]  [<ffffffff81138c56>] do_try_to_free_pages+0xd6/0x330
[  889.021276]  [<ffffffff81138f6b>] try_to_free_pages+0xbb/0xf0
[  889.022937]  [<ffffffff8112a8b6>] __alloc_pages_nodemask+0x716/0xe80
[  889.024684]  [<ffffffff812c2197>] ? blk_finish_plug+0x27/0x40
[  889.026322]  [<ffffffff812efa04>] ? __radix_tree_lookup+0x84/0xf0
[  889.028019]  [<ffffffff81177f07>] alloc_pages_current+0x87/0x110
[  889.029706]  [<ffffffff8111f33c>] __page_cache_alloc+0xdc/0x120
[  889.031392]  [<ffffffff81123233>] filemap_fault+0x333/0x570
[  889.033026]  [<ffffffff8126b519>] xfs_filemap_fault+0x39/0x60
[  889.034668]  [<ffffffff8114f774>] __do_fault+0x74/0x180
[  889.036218]  [<ffffffff811559f2>] handle_mm_fault+0xe82/0x1660
[  889.037878]  [<ffffffff8104da40>] __do_page_fault+0x180/0x550
[  889.039493]  [<ffffffff8104de31>] do_page_fault+0x21/0x70
[  889.040963]  [<ffffffff81002525>] ? do_syscall_64+0x165/0x170
[  889.042526]  [<ffffffff81616db2>] page_fault+0x22/0x30
[  889.044118] a.out: shrink_inactive_list() stalls for 10082ms[  889.044789] Mem-Info:
[  889.044793] active_anon:390112 inactive_anon:3030 isolated_anon:0
[  889.044793]  active_file:63 inactive_file:66 isolated_file:32
[  889.044793]  unevictable:0 dirty:5 writeback:3 unstable:0
[  889.044793]  slab_reclaimable:3306 slab_unreclaimable:17523
[  889.044793]  mapped:1012 shmem:4210 pagetables:2823 bounce:0
[  889.044793]  free:13235 free_pcp:31 free_cma:0
[  889.044796] Node 0 active_anon:1560448kB inactive_anon:12120kB active_file:252kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:4048kB dirty:20kB writeback:12kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1095680kB anon_thp: 16840kB writeback_tmp:0kB unstable:0kB pages_scanned:995 all_unreclaimable? yes
[  889.044796] Node 0 
[  889.044799] DMA free:7208kB min:404kB low:504kB high:604kB active_anon:8456kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:224kB kernel_stack:16kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]:
[  889.044799]  0 1707 1707 1707
Node 0 
[  889.044803] DMA32 free:45732kB min:44648kB low:55808kB high:66968kB active_anon:1551992kB inactive_anon:12120kB active_file:252kB inactive_file:264kB unevictable:0kB writepending:32kB present:2080640kB managed:1748672kB mlocked:0kB slab_reclaimable:13224kB slab_unreclaimable:69868kB kernel_stack:5584kB pagetables:11292kB bounce:0kB free_pcp:124kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]:
[  889.044803]  0 0 0 0
Node 0 
[  889.044805] DMA: 0*4kB 1*8kB (M) 4*16kB (UM) 5*32kB (UM) 9*64kB (UM) 4*128kB (UM) 3*256kB (U) 4*512kB (UM) 1*1024kB (M) 1*2048kB (E) 0*4096kB = 7208kB
Node 0 
[  889.044811] DMA32: 873*4kB (UME) 1010*8kB (UMEH) 717*16kB (UMEH) 445*32kB (UMEH) 100*64kB (UMH) 8*128kB (UMH) 0*256kB 0*512kB 1*1024kB (H) 0*2048kB 0*4096kB = 45732kB
[  889.044817] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  889.044818] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  889.044818] 4379 total pagecache pages
[  889.044820] 0 pages in swap cache
[  889.044820] Swap cache stats: add 0, delete 0, find 0/0
[  889.044821] Free swap  = 0kB
[  889.044821] Total swap = 0kB
[  889.044821] 524157 pages RAM
[  889.044821] 0 pages HighMem/MovableOnly
[  889.044822] 83013 pages reserved
[  889.044822] 0 pages hwpoisoned
(...snipped...)
[  939.150914] INFO: task kswapd0:60 blocked for more than 60 seconds.
[  939.152922]       Tainted: G        W       4.9.0-rc1+ #80
[  939.154891] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  939.157296] kswapd0         D ffffffff816111f7     0    60      2 0x00000000
[  939.159659]  ffff88007a8c9438 ffff880077ff34c0 ffff88007ac93c40 ffff88007a8c8f40
[  939.162131]  ffff88007f816d80 ffffc9000053b780 ffffffff816111f7 000000009b1d7cf9
[  939.164582]  ffff88007a8c8f40 ffff8800776fda18 ffffc9000053b7b0 ffff8800776fda30
[  939.167053] Call Trace:
[  939.168450]  [<ffffffff816111f7>] ? __schedule+0x177/0x550
[  939.170417]  [<ffffffff8161160b>] schedule+0x3b/0x90
[  939.172285]  [<ffffffff81614064>] rwsem_down_read_failed+0xf4/0x160
[  939.174411]  [<ffffffff812bf7ec>] ? get_request+0x43c/0x770
[  939.176429]  [<ffffffff812f6818>] call_rwsem_down_read_failed+0x18/0x30
[  939.178615]  [<ffffffff816133c2>] down_read+0x12/0x30
[  939.180544]  [<ffffffff81277dae>] xfs_ilock+0x3e/0xa0
[  939.182427]  [<ffffffff81261a70>] xfs_map_blocks+0x80/0x180
[  939.184415]  [<ffffffff81262bd8>] xfs_do_writepage+0x1c8/0x710
[  939.186458]  [<ffffffff81261ec9>] ? xfs_setfilesize_trans_alloc.isra.31+0x39/0x90
[  939.189711]  [<ffffffff81263156>] xfs_vm_writepage+0x36/0x70
[  939.192094]  [<ffffffff81134a47>] pageout.isra.42+0x1a7/0x2b0
[  939.194191]  [<ffffffff81136b47>] shrink_page_list+0x7c7/0xb70
[  939.196254]  [<ffffffff81137798>] shrink_inactive_list+0x238/0x5a0
[  939.198541]  [<ffffffff81138389>] shrink_node_memcg+0x509/0x7b0
[  939.200611]  [<ffffffff8113870c>] shrink_node+0xdc/0x320
[  939.202522]  [<ffffffff8113949a>] kswapd+0x2ca/0x620
[  939.204271]  [<ffffffff811391d0>] ? mem_cgroup_shrink_node+0xb0/0xb0
[  939.206243]  [<ffffffff81081234>] kthread+0xd4/0xf0
[  939.207904]  [<ffffffff81081160>] ? kthread_park+0x60/0x60
[  939.209848]  [<ffffffff81615922>] ret_from_fork+0x22/0x30
---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-shrink-loop.txt.xz ----------

This means that, unless we scatter around warn_alloc() to all locations
which might depend on somebody else to make forward progress, we may
fail to get a clue.

The code will look messy if we scatter around warn_alloc() calls.
Also, it is more likely that multiple concurrent warn_alloc() calls race.
Even if we guard warn_alloc() with mutex_lock(&oom_lock)/mutex_unlock(&oom_lock),
messages from hung task watchdog and messages from warn_alloc() calls will race.

There is an alternative approach I proposed at
http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx
which serializes both hung task watchdog and warn_alloc().

So, how can we make warn_alloc() reliable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>