[BUG] deadlock about mmap_sem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I found a deadlock problem about mmap_sem on RHEL6.1 kernel,
and I've not verified yet, the problem will still exist in latest kernel, IIUC.

- summary
  I did a page migration test via cpuset I/F. I tried to move a process which
  used too big memory for the destination cpuset.
  The process and the "/bin/echo" are usually oom-killed successfully, but
  the processe and khugepaged are sometimes stuck after the oom.

- detail
  (On 8CPU, 4NODES(1GB/NODE), noswap system)
  1. create a cpuset directories
[root@RHEL6 ~]# mkdir /cgroup/cpuset/src
[root@RHEL6 ~]# mkdir /cgroup/cpuset/dst
[root@RHEL6 ~]#
[root@RHEL6 ~]# echo 1-2,5-6 >/cgroup/cpuset/src/cpuset.cpus
[root@RHEL6 ~]# echo 3,7 >/cgroup/cpuset/dst/cpuset.cpus
[root@RHEL6 ~]#
[root@RHEL6 ~]# echo 1-2 >/cgroup/cpuset/src/cpuset.mems
[root@RHEL6 ~]# echo 3 >/cgroup/cpuset/dst/cpuset.mems
[root@RHEL6 ~]#
[root@RHEL6 ~]# echo 1 >/cgroup/cpuset/dst/cpuset.memory_migrate

  2. register the shell to src(2NODES) directory
[root@RHEL6 ~]# echo $$ >/cgroup/cpuset/src/tasks

  3. run a program which uses 1GB anonymous memory
[root@RHEL6 ~]# ~nishimura/c/malloc 1024

  4. (In another terminal) move the process to dst(1NODES) directory
[root@RHEL6 ~]# /bin/echo `pidof malloc` >/cgroup/cpuset/dst/tasks

  5. oom-killer killed "malloc", but both "malloc" and "khugepaged" are stuck.
[  815.353853] echo invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0
[  815.356779] echo cpuset=/ mems_allowed=3
[  815.358313] Pid: 2000, comm: echo Tainted: G           ---------------- T 2.6.32-131.0.15.el6.x86_64 #1
[  815.362078] Call Trace:
[  815.363139]  [<ffffffff810c0101>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[  815.365912]  [<ffffffff811102bb>] ? oom_kill_process+0xcb/0x2e0
[  815.368303]  [<ffffffff81110880>] ? select_bad_process+0xd0/0x110
[  815.370681]  [<ffffffff81110918>] ? __out_of_memory+0x58/0xc0
[  815.372996]  [<ffffffff81110b19>] ? out_of_memory+0x199/0x210
[  815.375274]  [<ffffffff811202dd>] ? __alloc_pages_nodemask+0x80d/0x8b0
[  815.377982]  [<ffffffff81150b99>] ? new_node_page+0x29/0x30
[  815.380253]  [<ffffffff8115fb4f>] ? migrate_pages+0xaf/0x4c0
[  815.382557]  [<ffffffff81150b70>] ? new_node_page+0x0/0x30
[  815.384714]  [<ffffffff811524ed>] ? migrate_to_node+0xbd/0xd0
[  815.386977]  [<ffffffff8115269e>] ? do_migrate_pages+0x19e/0x210
[  815.389559]  [<ffffffff810c1198>] ? cpuset_migrate_mm+0x78/0xa0
[  815.392080]  [<ffffffff810c25e7>] ? cpuset_attach+0x197/0x1d0
[  815.394532]  [<ffffffff810be87e>] ? cgroup_attach_task+0x21e/0x660
[  815.397230]  [<ffffffff810bf00c>] ? cgroup_tasks_write+0x5c/0xf0
[  815.399584]  [<ffffffff810bc01a>] ? cgroup_file_write+0x2ba/0x320
[  815.402101]  [<ffffffff8113e17a>] ? do_mmap_pgoff+0x33a/0x380
[  815.404580]  [<ffffffff81172718>] ? vfs_write+0xb8/0x1a0
[  815.406975]  [<ffffffff810d1b62>] ? audit_syscall_entry+0x272/0x2a0
[  815.409956]  [<ffffffff81173151>] ? sys_write+0x51/0x90
[  815.412542]  [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b
[  815.415442] Mem-Info:
[  815.416653] Node 0 DMA per-cpu:
[  815.418303] CPU    0: hi:    0, btch:   1 usd:   0
[  815.420658] CPU    1: hi:    0, btch:   1 usd:   0
[  815.422824] CPU    2: hi:    0, btch:   1 usd:   0
[  815.425148] CPU    3: hi:    0, btch:   1 usd:   0
[  815.427334] CPU    4: hi:    0, btch:   1 usd:   0
[  815.429560] CPU    5: hi:    0, btch:   1 usd:   0
[  815.431638] CPU    6: hi:    0, btch:   1 usd:   0
[  815.433772] CPU    7: hi:    0, btch:   1 usd:   0
[  815.435893] Node 0 DMA32 per-cpu:
[  815.437550] CPU    0: hi:  186, btch:  31 usd:  75
[  815.439680] CPU    1: hi:  186, btch:  31 usd:   0
[  815.441929] CPU    2: hi:  186, btch:  31 usd:   0
[  815.443979] CPU    3: hi:  186, btch:  31 usd:   0
[  815.446172] CPU    4: hi:  186, btch:  31 usd: 167
[  815.448317] CPU    5: hi:  186, btch:  31 usd:   2
[  815.450592] CPU    6: hi:  186, btch:  31 usd:  17
[  815.452708] CPU    7: hi:  186, btch:  31 usd:   2
[  815.454845] Node 1 DMA32 per-cpu:
[  815.456623] CPU    0: hi:  186, btch:  31 usd:   0
[  815.458773] CPU    1: hi:  186, btch:  31 usd:  83
[  815.460964] CPU    2: hi:  186, btch:  31 usd:   0
[  815.463183] CPU    3: hi:  186, btch:  31 usd:  41
[  815.465387] CPU    4: hi:  186, btch:  31 usd:   0
[  815.467330] CPU    5: hi:  186, btch:  31 usd:  51
[  815.469392] CPU    6: hi:  186, btch:  31 usd:  24
[  815.471383] CPU    7: hi:  186, btch:  31 usd: 170
[  815.473307] Node 2 DMA32 per-cpu:
[  815.474903] CPU    0: hi:  186, btch:  31 usd:   0
[  815.476899] CPU    1: hi:  186, btch:  31 usd:   0
[  815.478838] CPU    2: hi:  186, btch:  31 usd: 178
[  815.480822] CPU    3: hi:  186, btch:  31 usd:   0
[  815.482828] CPU    4: hi:  186, btch:  31 usd:   0
[  815.484750] CPU    5: hi:  186, btch:  31 usd: 137
[  815.486746] CPU    6: hi:  186, btch:  31 usd: 185
[  815.488759] CPU    7: hi:  186, btch:  31 usd: 169
[  815.490770] Node 3 DMA32 per-cpu:
[  815.492238] CPU    0: hi:  186, btch:  31 usd:   0
[  815.494280] CPU    1: hi:  186, btch:  31 usd:   0
[  815.496273] CPU    2: hi:  186, btch:  31 usd: 128
[  815.498229] CPU    3: hi:  186, btch:  31 usd: 177
[  815.500208] CPU    4: hi:  186, btch:  31 usd: 125
[  815.502160] CPU    5: hi:  186, btch:  31 usd:   0
[  815.504046] CPU    6: hi:  186, btch:  31 usd:  10
[  815.506003] CPU    7: hi:  186, btch:  31 usd:  30
[  815.507917] Node 3 Normal per-cpu:
[  815.509546] CPU    0: hi:  186, btch:  31 usd:   0
[  815.511588] CPU    1: hi:  186, btch:  31 usd:   1
[  815.513599] CPU    2: hi:  186, btch:  31 usd: 113
[  815.515592] CPU    3: hi:  186, btch:  31 usd: 162
[  815.517509] CPU    4: hi:  186, btch:  31 usd: 131
[  815.519363] CPU    5: hi:  186, btch:  31 usd:  66
[  815.521372] CPU    6: hi:  186, btch:  31 usd: 111
[  815.523314] CPU    7: hi:  186, btch:  31 usd:  81
[  815.525329] active_anon:229332 inactive_anon:11795 isolated_anon:22914
[  815.525332]  active_file:2555 inactive_file:5002 isolated_file:0
[  815.525333]  unevictable:0 dirty:0 writeback:0 unstable:0
[  815.525335]  free:707140 slab_reclaimable:1756 slab_unreclaimable:23190
[  815.525336]  mapped:1240 shmem:49 pagetables:943 bounce:0
[  815.537613] Node 0 DMA free:15720kB min:500kB low:624kB high:748kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15328kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  815.552215] lowmem_reserve[]: 0 994 994 994
[  815.554229] Node 0 DMA32 free:887236kB min:33328kB low:41660kB high:49992kB active_anon:2108kB inactive_anon:8kB active_file:3836kB inactive_file:8860kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1018080kB mlocked:0kB dirty:0kB writeback:0kB mapped:2088kB shmem:52kB slab_reclaimable:1788kB slab_unreclaimable:26864kB kernel_stack:1312kB pagetables:476kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  815.570204] lowmem_reserve[]: 0 0 0 0
[  815.572056] Node 1 DMA32 free:992108kB min:33856kB low:42320kB high:50784kB active_anon:1424kB inactive_anon:0kB active_file:2832kB inactive_file:5312kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:0kB mapped:1036kB shmem:56kB slab_reclaimable:1756kB slab_unreclaimable:22516kB kernel_stack:80kB pagetables:160kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  815.587848] lowmem_reserve[]: 0 0 0 0
[  815.589745] Node 2 DMA32 free:897996kB min:33856kB low:42320kB high:50784kB active_anon:1408kB inactive_anon:0kB active_file:3444kB inactive_file:5832kB unevictable:0kB isolated(anon):91656kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:0kB mapped:1788kB shmem:40kB slab_reclaimable:1728kB slab_unreclaimable:21344kB kernel_stack:80kB pagetables:2572kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  815.605725] lowmem_reserve[]: 0 0 0 0
[  815.607589] Node 3 DMA32 free:18664kB min:16692kB low:20864kB high:25036kB active_anon:486584kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:509940kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  815.622718] lowmem_reserve[]: 0 0 505 505
[  815.624728] Node 3 Normal free:16836kB min:16928kB low:21160kB high:25392kB active_anon:425804kB inactive_anon:47172kB active_file:108kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:0kB dirty:0kB writeback:0kB mapped:48kB shmem:48kB slab_reclaimable:1752kB slab_unreclaimable:22036kB kernel_stack:72kB pagetables:564kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:11168 all_unreclaimable? yes
[  815.640879] lowmem_reserve[]: 0 0 0 0
[  815.642694] Node 0 DMA: 2*4kB 0*8kB 2*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15720kB
[  815.647939] Node 0 DMA32: 687*4kB 529*8kB 220*16kB 84*32kB 39*64kB 19*128kB 11*256kB 6*512kB 3*1024kB 2*2048kB 209*4096kB = 887236kB
[  815.653880] Node 1 DMA32: 779*4kB 598*8kB 271*16kB 119*32kB 33*64kB 11*128kB 9*256kB 9*512kB 7*1024kB 4*2048kB 232*4096kB = 992108kB
[  815.659720] Node 2 DMA32: 324*4kB 291*8kB 278*16kB 130*32kB 44*64kB 25*128kB 19*256kB 13*512kB 6*1024kB 5*2048kB 208*4096kB = 898120kB
[  815.665564] Node 3 DMA32: 0*4kB 1*8kB 0*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 4*4096kB = 18664kB
[  815.670913] Node 3 Normal: 267*4kB 285*8kB 141*16kB 77*32kB 27*64kB 9*128kB 7*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 16836kB
[  815.676648] 7605 total pagecache pages
[  815.678217] 0 pages in swap cache
[  815.679620] Swap cache stats: add 0, delete 0, find 0/0
[  815.681834] Free swap  = 0kB
[  815.683028] Total swap = 0kB
[  815.718410] 1048575 pages RAM
[  815.719786] 34946 pages reserved
[  815.721251] 30969 pages shared
[  815.722574] 255845 pages non-shared
[  815.724017] Out of memory: kill process 1992 (malloc) score 16445 or a child
[  815.726946] Killed process 1992 (malloc) vsz:1052484kB, anon-rss:1048640kB, file-rss:288kB

[  961.025171] INFO: task khugepaged:1967 blocked for more than 120 seconds.
[  961.028371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  961.031434] khugepaged    D 0000000000000001     0  1967      2 0x00000080
[  961.034394]  ffff88003db7bc90 0000000000000046 0000000000000000 ffff8800c0010e40
[  961.037555]  0000000000015f80 0000000000000002 0000000000000003 000000010007d852
[  961.040758]  ffff88003d6bf078 ffff88003db7bfd8 000000000000f598 ffff88003d6bf078
[  961.043758] Call Trace:
[  961.044676]  [<ffffffff814dd755>] rwsem_down_failed_common+0x95/0x1d0
[  961.046956]  [<ffffffff814dd8b3>] rwsem_down_write_failed+0x23/0x30
[  961.049783]  [<ffffffff8126e573>] call_rwsem_down_write_failed+0x13/0x20
[  961.052957]  [<ffffffff814dcdb2>] ? down_write+0x32/0x40
[  961.055573]  [<ffffffff8116ad67>] khugepaged+0x847/0x1200
[  961.058148]  [<ffffffff814db337>] ? thread_return+0x4e/0x777
[  961.060995]  [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
[  961.063992]  [<ffffffff8116a520>] ? khugepaged+0x0/0x1200
[  961.066763]  [<ffffffff8108ddf6>] kthread+0x96/0xa0
[  961.069008]  [<ffffffff8100c1ca>] child_rip+0xa/0x20
[  961.071600]  [<ffffffff8108dd60>] ? kthread+0x0/0xa0
[  961.073958]  [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
[  961.076522] INFO: task malloc:1992 blocked for more than 120 seconds.
[  961.079588] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  961.083502] malloc        D 0000000000000003     0  1992   1780 0x00100080
[  961.086867]  ffff8800be25fba0 0000000000000086 0000000000000002 ffff8800bef716f0
[  961.090925]  000280da00000000 0000000000000286 0000000000000030 ffff880080010790
[  961.094932]  ffff8800bd6565f8 ffff8800be25ffd8 000000000000f598 ffff8800bd6565f8
[  961.098926] Call Trace:
[  961.100127]  [<ffffffff8111fbe1>] ? __alloc_pages_nodemask+0x111/0x8b0
[  961.103422]  [<ffffffff814dd755>] rwsem_down_failed_common+0x95/0x1d0
[  961.106451]  [<ffffffff8110d1de>] ? find_get_page+0x1e/0xa0
[  961.109089]  [<ffffffff8110e5f3>] ? filemap_fault+0xd3/0x500
[  961.111797]  [<ffffffff814dd8e6>] rwsem_down_read_failed+0x26/0x30
[  961.114869]  [<ffffffff8126e544>] call_rwsem_down_read_failed+0x14/0x30
[  961.117958]  [<ffffffff814dcde4>] ? down_read+0x24/0x30
[  961.120636]  [<ffffffff810b53d8>] acct_collect+0x48/0x1b0
[  961.123166]  [<ffffffff8106c422>] do_exit+0x7e2/0x860
[  961.125793]  [<ffffffff8107d94d>] ? __sigqueue_free+0x3d/0x50
[  961.128583]  [<ffffffff8108126f>] ? __dequeue_signal+0xdf/0x1f0
[  961.131500]  [<ffffffff8106c4f8>] do_group_exit+0x58/0xd0
[  961.133925]  [<ffffffff81081946>] get_signal_to_deliver+0x1f6/0x460
[  961.136957]  [<ffffffff8100a365>] do_signal+0x75/0x800
[  961.139483]  [<ffffffff81055dcf>] ? finish_task_switch+0x4f/0xf0
[  961.142407]  [<ffffffff814db337>] ? thread_return+0x4e/0x777
[  961.145037]  [<ffffffff8100ab80>] do_notify_resume+0x90/0xc0
[  961.147830]  [<ffffffff8100b440>] int_signal+0x12/0x17


- investigation
  From the call traces above, "malloc" and "khugepaged" are stuck at:
(malloc)
0xffffffff810b53d8 is in acct_collect (kernel/acct.c:613).
608             unsigned long vsize = 0;
609
610             if (group_dead && current->mm) {
611                     struct vm_area_struct *vma;
612                     down_read(&current->mm->mmap_sem);
613                     vma = current->mm->mmap;
614                     while (vma) {
615                             vsize += vma->vm_end - vma->vm_start;
616                             vma = vma->vm_next;
617                     }

(khugepaged)
0xffffffff8116ad62 is in khugepaged (mm/huge_memory.c:1696).
1691            /*
1692             * Prevent all access to pagetables with the exception of
1693             * gup_fast later hanlded by the ptep_clear_flush and the VM
1694             * handled by the anon_vma lock + PG_lock.
1695             */
1696            down_write(&mm->mmap_sem);
1697            if (unlikely(khugepaged_test_exit(mm)))
1698                    goto out;
1699
1700            vma = find_vma(mm, address);

  And "echo" seems to be trying to allocate memory, calling out_of_memory() repeatedly.

crash> bt 2000
PID: 2000   TASK: ffff88007e0dd580  CPU: 4   COMMAND: "echo"
 #0 [ffff88007d7e9728] schedule at ffffffff814db2e9
 #1 [ffff88007d7e97f0] schedule_timeout at ffffffff814dbfe2
 #2 [ffff88007d7e98a0] schedule_timeout_uninterruptible at ffffffff814dc14e
 #3 [ffff88007d7e98b0] out_of_memory at ffffffff81110b3c
 #4 [ffff88007d7e9950] __alloc_pages_nodemask at ffffffff811202dd
 #5 [ffff88007d7e9a70] new_node_page at ffffffff81150b99
 #6 [ffff88007d7e9a80] migrate_pages at ffffffff8115fb4f
 #7 [ffff88007d7e9b30] migrate_to_node at ffffffff811524ed
 #8 [ffff88007d7e9ba0] do_migrate_pages at ffffffff8115269e
 #9 [ffff88007d7e9c40] cpuset_migrate_mm at ffffffff810c1198
#10 [ffff88007d7e9c60] cpuset_attach at ffffffff810c25e7
#11 [ffff88007d7e9d20] cgroup_attach_task at ffffffff810be87e
#12 [ffff88007d7e9e10] cgroup_tasks_write at ffffffff810bf00c
#13 [ffff88007d7e9e40] cgroup_file_write at ffffffff810bc01a
#14 [ffff88007d7e9ef0] vfs_write at ffffffff81172718
#15 [ffff88007d7e9f30] sys_write at ffffffff81173151
#16 [ffff88007d7e9f80] system_call_fastpath at ffffffff8100b172
    RIP: 00000030cc4d95e0  RSP: 00007fffdbc44e58  RFLAGS: 00010206
    RAX: 0000000000000001  RBX: ffffffff8100b172  RCX: 00000030cc78c330
    RDX: 0000000000000005  RSI: 00007fcdc9e89000  RDI: 0000000000000001
    RBP: 00007fcdc9e89000   R8: 00007fcdc9e73700   R9: 0000000000000000
    R10: 00000000ffffffff  R11: 0000000000000246  R12: 0000000000000005
    R13: 00000030cc78b780  R14: 0000000000000005  R15: 00000030cc78b780
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b


  So, I think what's happening is:

  1. "echo" tries to allocate memory while it's holding(down_read) mmap_sem
     of "malloc" at do_migrate_pages() and invokes oom-killer.
  2. "malloc" is oom-killed and tries to exit.
  3. Before "malloc" aquires(down_read) its mmap_sem, "khugepaged" tries to
     aquire(down_write) the mmap_sem of "malloc", and fails.
  4. Because "khugepaged" has been waiting the mmap_sem already, "malloc"
     cannot aquire(down_read) the mmap_sem, so it cannot exit.
     IOW, it cannot free its memory.
  5. "echo" calls out_of_memory() repeatedly, but select_bad_process() returns
     ERR_PTR(-1UL), so it doesn't do anything.


Any ideas how to fix this problem?

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]