Jan Stancek wrote: > On 01/22/2016 04:14 PM, Jan Stancek wrote: > > On 01/19/2016 11:29 AM, Tetsuo Handa wrote: > >> although I > >> couldn't find evidence that mlock() and madvice() are related with this hangup, > > > > I simplified reproducer by having only single thread allocating > > memory when OOM triggers: > > http://jan.stancek.eu/tmp/oom_hangs/console.log.3-v4.4-8606-with-memalloc.txt > > > > In this instance it was mmap + mlock, as you can see from oom call trace. > > It made it to do_exit(), but couldn't complete it: > > I have extracted test from LTP into standalone reproducer (attached), > if you want to give a try. It usually hangs my system within ~30 > minutes. If it takes too long, you can try disabling swap. From my past > experience this usually helped to reproduce it faster on small KVM guests. > > # gcc oom_mlock.c -pthread -O2 > # echo 1 > /proc/sys/vm/overcommit_memory > (optionally) # swapoff -a > # ./a.out > > Also, it's interesting to note, that when I disabled mlock() calls > test ran fine over night. I'll look into confirming this observation > on more systems. > Thank you for a reproducer. I tried it with ---------- --- oom_mlock.c +++ oom_mlock.c @@ -33,7 +33,7 @@ if (s == MAP_FAILED) return errno; - if (do_mlock) { + if (0 && do_mlock) { while (mlock(s, length) == -1 && loop > 0) { if (EAGAIN != errno) return errno; ---------- applied (i.e. disabled mlock() calls) on a VM with 4CPUs / 5120MB RAM, and successfully reproduced a livelock. Therefore, I think mlock() is irrelevant. What I observed is that while disk_events_workfn workqueue item was looping, "Node 0 Normal free:" remained smaller than "min:" but "Node 0 Normal:" was larger than "Node 0 Normal free:". Is this difference caused by pending vmstat_update, vmstat_shepherd, vmpressure_work_fn ? Can we somehow check how long these workqueue items remained pending? Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160126.txt.xz . ---------- [ 312.018243] Out of memory: Kill process 9785 (a.out) score 937 or sacrifice child [ 312.020469] Killed process 9785 (a.out) total-vm:6314312kB, anon-rss:4637568kB, file-rss:0kB [ 323.867935] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task. [ 323.870075] MemAlloc: kworker/1:8(9698) seq=10 gfp=0x2400000 order=0 delay=9828 [ 323.872259] kworker/1:8 R running task 0 9698 2 0x00000080 [ 323.874452] Workqueue: events_freezable_power_ disk_events_workfn (...snipped...) [ 324.194104] Mem-Info: [ 324.195095] active_anon:1165831 inactive_anon:3486 isolated_anon:0 [ 324.195095] active_file:1 inactive_file:0 isolated_file:0 [ 324.195095] unevictable:0 dirty:0 writeback:0 unstable:0 [ 324.195095] slab_reclaimable:1906 slab_unreclaimable:5202 [ 324.195095] mapped:1175 shmem:4204 pagetables:2555 bounce:0 [ 324.195095] free:8087 free_pcp:0 free_cma:0 [ 324.204574] Node 0 DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 324.213795] lowmem_reserve[]: 0 2708 4673 4673 [ 324.215372] Node 0 DMA32 free:12836kB min:5008kB low:6260kB high:7512kB active_anon:2712464kB inactive_anon:7984kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129216kB managed:2776588kB mlocked:0kB dirty:0kB writeback:0kB mapped:2880kB shmem:9704kB slab_reclaimable:4556kB slab_unreclaimable:9116kB kernel_stack:1184kB pagetables:4644kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 324.226159] lowmem_reserve[]: 0 0 1965 1965 [ 324.227795] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184 all_unreclaimable? yes [ 324.238987] lowmem_reserve[]: 0 0 0 0 [ 324.240614] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 2*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB [ 324.244417] Node 0 DMA32: 38*4kB (UM) 24*8kB (UME) 146*16kB (UME) 139*32kB (UME) 52*64kB (UME) 11*128kB (UME) 2*256kB (UM) 1*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 12888kB [ 324.249520] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB [ 324.253481] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 324.255910] 4219 total pagecache pages [ 324.257497] 0 pages in swap cache [ 324.258996] Swap cache stats: add 0, delete 0, find 0/0 [ 324.260835] Free swap = 0kB [ 324.262243] Total swap = 0kB [ 324.263648] 1310589 pages RAM [ 324.265051] 0 pages HighMem/MovableOnly [ 324.266607] 109257 pages reserved [ 324.268079] 0 pages hwpoisoned [ 324.269503] Showing busy workqueues and worker pools: [ 324.271285] workqueue events: flags=0x0 [ 324.272865] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=3/256 [ 324.274933] pending: vmstat_shepherd, vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx] [ 324.277426] workqueue events_freezable_power_: flags=0x84 [ 324.279303] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 324.281398] in-flight: 9698:disk_events_workfn [ 324.283242] workqueue vmstat: flags=0xc [ 324.284849] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 324.286947] pending: vmstat_update [ 324.288647] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=10 idle: 9696 14 9665 9701 9675 9664 407 46 9691 (...snipped...) [ 334.771342] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184 all_unreclaimable? yes (...snipped...) [ 334.792901] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 345.351647] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184 all_unreclaimable? yes (...snipped...) [ 345.373938] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 355.983345] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184 all_unreclaimable? yes (...snipped...) [ 356.008948] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 366.569312] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 366.591131] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 377.255040] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 377.276782] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 387.948890] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 387.970629] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 398.582824] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 398.604575] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 409.307406] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 409.329012] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 419.866932] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 419.888642] Node 0 Normal: 24*4kB (UM) 0*8kB 2*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3744kB (...snipped...) [ 430.444391] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 430.469660] Node 0 Normal: 24*4kB (UM) 0*8kB 3*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3760kB (...snipped...) [ 441.055258] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 441.078312] Node 0 Normal: 24*4kB (UM) 0*8kB 3*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3760kB (...snipped...) [ 451.624394] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes (...snipped...) [ 451.646614] Node 0 Normal: 24*4kB (UM) 0*8kB 3*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3760kB (...snipped...) [ 461.701021] MemAlloc-Info: 6 stalling task, 0 dying task, 0 victim task. (...snipped...) [ 461.798182] MemAlloc: kworker/1:8(9698) seq=10 gfp=0x2400000 order=0 delay=147661 [ 461.800145] kworker/1:8 R running task 0 9698 2 0x00000080 [ 461.802061] Workqueue: events_freezable_power_ disk_events_workfn (...snipped...) [ 462.141865] Mem-Info: [ 462.142877] active_anon:1165831 inactive_anon:3486 isolated_anon:0 [ 462.142877] active_file:1 inactive_file:0 isolated_file:0 [ 462.142877] unevictable:0 dirty:0 writeback:0 unstable:0 [ 462.142877] slab_reclaimable:1906 slab_unreclaimable:5202 [ 462.142877] mapped:1175 shmem:4204 pagetables:2555 bounce:0 [ 462.142877] free:8087 free_pcp:0 free_cma:0 [ 462.152641] Node 0 DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 462.161919] lowmem_reserve[]: 0 2708 4673 4673 [ 462.163515] Node 0 DMA32 free:12836kB min:5008kB low:6260kB high:7512kB active_anon:2712464kB inactive_anon:7984kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129216kB managed:2776588kB mlocked:0kB dirty:0kB writeback:0kB mapped:2880kB shmem:9704kB slab_reclaimable:4556kB slab_unreclaimable:9116kB kernel_stack:1184kB pagetables:4644kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 462.174286] lowmem_reserve[]: 0 0 1965 1965 [ 462.175919] Node 0 Normal free:3624kB min:3632kB low:4540kB high:5448kB active_anon:1950860kB inactive_anon:5960kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2097152kB managed:2012836kB mlocked:0kB dirty:0kB writeback:0kB mapped:1820kB shmem:7112kB slab_reclaimable:3068kB slab_unreclaimable:11676kB kernel_stack:1952kB pagetables:5576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 462.187058] lowmem_reserve[]: 0 0 0 0 [ 462.188704] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 2*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB [ 462.192473] Node 0 DMA32: 38*4kB (UM) 24*8kB (UME) 146*16kB (UME) 139*32kB (UME) 52*64kB (UME) 11*128kB (UME) 2*256kB (UM) 1*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 12888kB [ 462.198054] Node 0 Normal: 24*4kB (UM) 0*8kB 3*16kB (U) 29*32kB (U) 16*64kB (UM) 1*128kB (U) 0*256kB 1*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 3760kB [ 462.202046] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 462.204481] 4219 total pagecache pages [ 462.206068] 0 pages in swap cache [ 462.207569] Swap cache stats: add 0, delete 0, find 0/0 [ 462.209416] Free swap = 0kB [ 462.210833] Total swap = 0kB [ 462.212240] 1310589 pages RAM [ 462.213656] 0 pages HighMem/MovableOnly [ 462.215229] 109257 pages reserved [ 462.216696] 0 pages hwpoisoned [ 462.218116] Showing busy workqueues and worker pools: [ 462.219928] workqueue events: flags=0x0 [ 462.221501] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=3/256 [ 462.223582] pending: vmstat_shepherd, vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx] [ 462.226085] workqueue events_power_efficient: flags=0x80 [ 462.227943] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 462.230056] pending: check_lifetime [ 462.231713] workqueue events_freezable_power_: flags=0x84 [ 462.233577] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 462.235693] in-flight: 9698:disk_events_workfn [ 462.237540] workqueue vmstat: flags=0xc [ 462.239172] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 462.241287] pending: vmstat_update [ 462.242970] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=10 idle: 9696 14 9665 9701 9675 9664 407 46 9691 ---------- Above result was obtained using below patch on Linux 4.4. ---------- diff --git a/include/linux/sched.h b/include/linux/sched.h index fa39434..daf2a1a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1375,6 +1375,28 @@ struct tlbflush_unmap_batch { bool writable; }; +struct memalloc_info { + /* For locking and progress monitoring. */ + unsigned int sequence; + /* + * 0: not doing __GFP_RECLAIM allocation. + * 1: doing non-recursive __GFP_RECLAIM allocation. + * 2: doing recursive __GFP_RECLAIM allocation. + */ + u8 valid; + /* + * bit 0: Will be reported as OOM victim. + * bit 1: Will be reported as dying task. + * bit 2: Will be reported as stalling task. + */ + u8 type; + /* Started time in jiffies as of valid == 1. */ + unsigned long start; + /* Requested order and gfp flags as of valid == 1. */ + unsigned int order; + gfp_t gfp; +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1813,6 +1835,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + struct memalloc_info memalloc; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index c9e4731..fb3004a 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -9,6 +9,9 @@ extern int sysctl_hung_task_warnings; extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK +extern unsigned long sysctl_memalloc_task_timeout_secs; +#endif #else /* Avoid need for ifdefs elsewhere in the code */ enum { sysctl_hung_task_timeout_secs = 0 }; diff --git a/kernel/fork.c b/kernel/fork.c index 1155eac..894221df 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1416,6 +1416,10 @@ static struct task_struct *copy_process(unsigned long clone_flags, p->sequential_io_avg = 0; #endif +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + p->memalloc.sequence = 0; +#endif + /* Perform scheduler related setup. Assign this task to a CPU. */ retval = sched_fork(clone_flags, p); if (retval) diff --git a/kernel/hung_task.c b/kernel/hung_task.c index e0f90c2..2abebb9 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -16,6 +16,7 @@ #include <linux/export.h> #include <linux/sysctl.h> #include <linux/utsname.h> +#include <linux/console.h> #include <trace/events/sched.h> /* @@ -72,6 +73,207 @@ static struct notifier_block panic_block = { .notifier_call = hung_task_panic, }; +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK +/* + * Zero means infinite timeout - no checking done: + */ +unsigned long __read_mostly sysctl_memalloc_task_timeout_secs = + CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT; +static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */ + +static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout) +{ + struct task_struct *g, *p; + long t; + unsigned long delta; + + /* timeout of 0 will disable the watchdog */ + if (!timeout) + return MAX_SCHEDULE_TIMEOUT; + /* At least wait for timeout duration. */ + t = last_checked - jiffies + timeout * HZ; + if (t > 0) + return t; + /* Calculate how long to wait more. */ + t = timeout * HZ; + delta = t - jiffies; + + /* + * We might see outdated values in "struct memalloc_info" here. + * We will recheck later using is_stalling_task(). + */ + preempt_disable(); + rcu_read_lock(); + for_each_process_thread(g, p) { + if (likely(!p->memalloc.valid)) + continue; + t = min_t(long, t, p->memalloc.start + delta); + if (unlikely(t <= 0)) + goto stalling; + } + stalling: + rcu_read_unlock(); + preempt_enable(); + return t; +} + +/** + * is_stalling_task - Check and copy a task's memalloc variable. + * + * @task: A task to check. + * @expire: Timeout in jiffies. + * + * Returns true if a task is stalling, false otherwise. + */ +static bool is_stalling_task(const struct task_struct *task, + const unsigned long expire) +{ + const struct memalloc_info *m = &task->memalloc; + + /* + * If start_memalloc_timer() is updating "struct memalloc_info" now, + * we can ignore it because timeout jiffies cannot be expired as soon + * as updating it completes. + */ + if (!m->valid || (m->sequence & 1)) + return false; + smp_rmb(); /* Block start_memalloc_timer(). */ + memalloc.start = m->start; + memalloc.order = m->order; + memalloc.gfp = m->gfp; + smp_rmb(); /* Unblock start_memalloc_timer(). */ + memalloc.sequence = m->sequence; + /* + * If start_memalloc_timer() started updating it while we read it, + * we can ignore it for the same reason. + */ + if (!m->valid || (memalloc.sequence & 1)) + return false; + /* This is a valid "struct memalloc_info". Check for timeout. */ + return time_after_eq(expire, memalloc.start); +} + +/* Check for memory allocation stalls. */ +static void check_memalloc_stalling_tasks(unsigned long timeout) +{ + char buf[128]; + struct task_struct *g, *p; + unsigned long now; + unsigned long expire; + unsigned int sigkill_pending; + unsigned int memdie_pending; + unsigned int stalling_tasks; + + cond_resched(); + now = jiffies; + /* + * Report tasks that stalled for more than half of timeout duration + * because such tasks might be correlated with tasks that already + * stalled for full timeout duration. + */ + expire = now - timeout * (HZ / 2); + /* Count stalling tasks, dying and victim tasks. */ + sigkill_pending = 0; + memdie_pending = 0; + stalling_tasks = 0; + preempt_disable(); + rcu_read_lock(); + for_each_process_thread(g, p) { + u8 type = 0; + + if (test_tsk_thread_flag(p, TIF_MEMDIE)) { + type |= 1; + memdie_pending++; + } + if (fatal_signal_pending(p)) { + type |= 2; + sigkill_pending++; + } + if (is_stalling_task(p, expire)) { + type |= 4; + stalling_tasks++; + } + p->memalloc.type = type; + } + rcu_read_unlock(); + preempt_enable(); + if (!stalling_tasks) + return; + /* Report stalling tasks, dying and victim tasks. */ + pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n", + stalling_tasks, sigkill_pending, memdie_pending); + cond_resched(); + preempt_disable(); + rcu_read_lock(); + restart_report: + for_each_process_thread(g, p) { + bool can_cont; + u8 type; + + if (likely(!p->memalloc.type)) + continue; + p->memalloc.type = 0; + /* Recheck in case state changed meanwhile. */ + type = 0; + if (test_tsk_thread_flag(p, TIF_MEMDIE)) + type |= 1; + if (fatal_signal_pending(p)) + type |= 2; + if (is_stalling_task(p, expire)) { + type |= 4; + snprintf(buf, sizeof(buf), + " seq=%u gfp=0x%x order=%u delay=%lu", + memalloc.sequence >> 1, memalloc.gfp, + memalloc.order, now - memalloc.start); + } else { + buf[0] = '\0'; + } + if (unlikely(!type)) + continue; + /* + * Victim tasks get pending SIGKILL removed before arriving at + * do_exit(). Therefore, print " exiting" instead for " dying". + */ + pr_warn("MemAlloc: %s(%u)%s%s%s%s%s\n", p->comm, p->pid, buf, + (p->state & TASK_UNINTERRUPTIBLE) ? + " uninterruptible" : "", + (p->flags & PF_EXITING) ? " exiting" : "", + (type & 2) ? " dying" : "", + (type & 1) ? " victim" : ""); + sched_show_task(p); + debug_show_held_locks(p); + /* + * Since there could be thousands of tasks to report, we always + * sleep and try to flush printk() buffer after each report, in + * order to avoid RCU stalls and reduce possibility of messages + * being dropped by continuous printk() flood. + * + * Since not yet reported tasks have p->memalloc.type > 0, we + * can simply restart this loop in case "g" or "p" went away. + */ + get_task_struct(g); + get_task_struct(p); + rcu_read_unlock(); + preempt_enable(); + schedule_timeout_interruptible(1); + preempt_disable(); + rcu_read_lock(); + can_cont = pid_alive(g) && pid_alive(p); + put_task_struct(p); + put_task_struct(g); + if (!can_cont) + goto restart_report; + } + rcu_read_unlock(); + preempt_enable(); + cond_resched(); + /* Show memory information. (SysRq-m) */ + show_mem(0); + /* Show workqueue state. */ + show_workqueue_state(); +} +#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */ + static void check_hung_task(struct task_struct *t, unsigned long timeout) { unsigned long switch_count = t->nvcsw + t->nivcsw; @@ -185,10 +387,12 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) rcu_read_unlock(); } -static unsigned long timeout_jiffies(unsigned long timeout) +static unsigned long hung_timeout_jiffies(long last_checked, long timeout) { /* timeout of 0 will disable the watchdog */ - return timeout ? timeout * HZ : MAX_SCHEDULE_TIMEOUT; + if (!timeout) + return MAX_SCHEDULE_TIMEOUT; + return last_checked - jiffies + timeout * HZ; } /* @@ -224,18 +428,36 @@ EXPORT_SYMBOL_GPL(reset_hung_task_detector); */ static int watchdog(void *dummy) { + unsigned long hung_last_checked = jiffies; +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + unsigned long stall_last_checked = hung_last_checked; +#endif + set_user_nice(current, 0); for ( ; ; ) { unsigned long timeout = sysctl_hung_task_timeout_secs; - - while (schedule_timeout_interruptible(timeout_jiffies(timeout))) - timeout = sysctl_hung_task_timeout_secs; - - if (atomic_xchg(&reset_hung_task, 0)) + long t = hung_timeout_jiffies(hung_last_checked, timeout); +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + unsigned long timeout2 = sysctl_memalloc_task_timeout_secs; + long t2 = memalloc_timeout_jiffies(stall_last_checked, + timeout2); + + if (t2 <= 0) { + check_memalloc_stalling_tasks(timeout2); + stall_last_checked = jiffies; continue; - - check_hung_uninterruptible_tasks(timeout); + } +#else + long t2 = t; +#endif + if (t <= 0) { + if (!atomic_xchg(&reset_hung_task, 0)) + check_hung_uninterruptible_tasks(timeout); + hung_last_checked = jiffies; + continue; + } + schedule_timeout_interruptible(min(t, t2)); } return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index dc6858d..96460aa 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1061,6 +1061,16 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &neg_one, }, +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + { + .procname = "memalloc_task_timeout_secs", + .data = &sysctl_memalloc_task_timeout_secs, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_dohung_task_timeout_secs, + .extra2 = &hung_task_timeout_max, + }, +#endif #endif #ifdef CONFIG_COMPAT { diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 8c15b29..26d2c91 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -812,6 +812,30 @@ config BOOTPARAM_HUNG_TASK_PANIC_VALUE default 0 if !BOOTPARAM_HUNG_TASK_PANIC default 1 if BOOTPARAM_HUNG_TASK_PANIC +config DETECT_MEMALLOC_STALL_TASK + bool "Detect tasks stalling inside memory allocator" + default n + depends on DETECT_HUNG_TASK + help + This option emits warning messages and traces when memory + allocation requests are stalling, in order to catch unexplained + hangups/reboots caused by memory allocation stalls. + +config DEFAULT_MEMALLOC_TASK_TIMEOUT + int "Default timeout for stalling task detection (in seconds)" + depends on DETECT_MEMALLOC_STALL_TASK + default 10 + help + This option controls the default timeout (in seconds) used + to determine when a task has become non-responsive and should + be considered stalling inside memory allocator. + + It can be adjusted at runtime via the kernel.memalloc_task_timeout_secs + sysctl or by writing a value to + /proc/sys/kernel/memalloc_task_timeout_secs. + + A timeout of 0 disables the check. The default is 10 seconds. + endmenu # "Debug lockups and hangs" config PANIC_ON_OOPS diff --git a/mm/mlock.c b/mm/mlock.c index 339d9e0..d6006b1 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -172,7 +172,7 @@ static void __munlock_isolation_failed(struct page *page) */ unsigned int munlock_vma_page(struct page *page) { - unsigned int nr_pages; + int nr_pages; struct zone *zone = page_zone(page); /* For try_to_munlock() and to serialize with page migration */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9d666df..4e4e4b6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3165,6 +3165,37 @@ got_pg: return page; } +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK +static void start_memalloc_timer(const gfp_t gfp_mask, const int order) +{ + struct memalloc_info *m = ¤t->memalloc; + + /* We don't check for stalls for !__GFP_RECLAIM allocations. */ + if (!(gfp_mask & __GFP_RECLAIM)) + return; + /* We don't check for stalls for nested __GFP_RECLAIM allocations */ + if (!m->valid) { + m->sequence++; + smp_wmb(); /* Block is_stalling_task(). */ + m->start = jiffies; + m->order = order; + m->gfp = gfp_mask; + smp_wmb(); /* Unblock is_stalling_task(). */ + m->sequence++; + } + m->valid++; +} + +static void stop_memalloc_timer(const gfp_t gfp_mask) +{ + if (gfp_mask & __GFP_RECLAIM) + current->memalloc.valid--; +} +#else +#define start_memalloc_timer(gfp_mask, order) do { } while (0) +#define stop_memalloc_timer(gfp_mask) do { } while (0) +#endif + /* * This is the 'heart' of the zoned buddy allocator. */ @@ -3232,7 +3263,9 @@ retry_cpuset: alloc_mask = memalloc_noio_flags(gfp_mask); ac.spread_dirty_pages = false; + start_memalloc_timer(alloc_mask, order); page = __alloc_pages_slowpath(alloc_mask, order, &ac); + stop_memalloc_timer(alloc_mask); } if (kmemcheck_enabled && page) ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>