Hi Ext4 experts, I need your help to debug one ext4 issue. We consistently see Ext4 stuck at wait_transaction_locked after another process is killed by cgroup because of oom. We have two processes keeping writing data to the same disk, and one was killed because of oom; the other process will stall at all I/O operations pretty soon. After did some search online, and found a couple of similar reports. Not sure whether this is a known issue. The following is the system information and part of dmesg, and I also have the vmcore generated by sysrq+c. [tmp]$ uname -a Linux sedhaswell04-node-1 3.10.0-327.22.2.el7.cohesity.x86_64 #1 SMP Tue Jul 5 12:41:09 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux [tmp]$ cat /etc/centos-release CentOS Linux release 7.2.1511 (Core) [96766.113014] Killed process 81593 (java) total-vm:19626048kB, anon-rss:1424920kB, file-rss:5144kB [97047.933383] INFO: task jbd2/sdb1-8:3229 blocked for more than 120 seconds. [97047.941093] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [97047.949879] jbd2/sdb1-8 D ffff88103a587000 0 3229 2 0x00000080 [97047.949883] ffff88104ad1bc90 0000000000000046 ffff88085129d080 ffff88104ad1bfd8 [97047.949886] ffff88104ad1bfd8 ffff88104ad1bfd8 ffff88085129d080 ffff88104ad1bda0 [97047.949888] ffff881040d2e0c0 ffff88085129d080 ffff88104ad1bd88 ffff88103a587000 [97047.949891] Call Trace: [97047.949899] [<ffffffff8163b7f9>] schedule+0x29/0x70 [97047.949924] [<ffffffffa0201118>] jbd2_journal_commit_transaction+0x248/0x19a0 [jbd2] [97047.949932] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 [97047.949935] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [97047.949941] [<ffffffff8108d6fe>] ? try_to_del_timer_sync+0x5e/0x90 [97047.949947] [<ffffffffa0206d79>] kjournald2+0xc9/0x260 [jbd2] [97047.949949] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [97047.949953] [<ffffffffa0206cb0>] ? commit_timeout+0x10/0x10 [jbd2] [97047.949957] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [97047.949959] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [97047.949963] [<ffffffff816467d8>] ret_from_fork+0x58/0x90 [97047.949965] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [97047.950005] INFO: task kworker/u386:2:69088 blocked for more than 120 seconds. [97047.958096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [97047.966869] kworker/u386:2 D ffff88104ab27120 0 69088 2 0x00000080 [97047.966878] Workqueue: writeback bdi_writeback_workfn (flush-8:16) [97047.966880] ffff880b8b6638e8 0000000000000046 ffff8810482be780 ffff880b8b663fd8 [97047.966897] ffff880b8b663fd8 ffff880b8b663fd8 ffff8810482be780 ffff881040d2e000 [97047.966900] ffff881040d2e078 00000000000fb086 ffff88103a587000 ffff88104ab27120 [97047.966902] Call Trace: [97047.966905] [<ffffffff8163b7f9>] schedule+0x29/0x70 [97047.966911] [<ffffffffa01fe085>] wait_transaction_locked+0x85/0xd0 [jbd2] [97047.966913] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [97047.966917] [<ffffffffa01fe400>] start_this_handle+0x2b0/0x5d0 [jbd2] [97047.966925] [<ffffffffa02c8f03>] ? _scsih_qcmd+0x2d3/0x420 [mpt3sas] [97047.966928] [<ffffffff811c153a>] ? kmem_cache_alloc+0x1ba/0x1d0 [97047.966932] [<ffffffffa01fe933>] jbd2__journal_start+0xf3/0x1e0 [jbd2] [97047.966946] [<ffffffffa0377fc4>] ? ext4_writepages+0x434/0xd60 [ext4] [97047.966956] [<ffffffffa03a8ea9>] __ext4_journal_start_sb+0x69/0xe0 [ext4] [97047.966962] [<ffffffffa0377fc4>] ext4_writepages+0x434/0xd60 [ext4] [97047.966967] [<ffffffff81174c38>] ? generic_writepages+0x58/0x80 [97047.966970] [<ffffffff81175cde>] do_writepages+0x1e/0x40 [97047.966972] [<ffffffff81208a20>] __writeback_single_inode+0x40/0x220 [97047.966974] [<ffffffff8120948e>] writeback_sb_inodes+0x25e/0x420 [97047.966976] [<ffffffff812096ef>] __writeback_inodes_wb+0x9f/0xd0 [97047.966978] [<ffffffff81209f33>] wb_writeback+0x263/0x2f0 [97047.966980] [<ffffffff810a0d96>] ? set_worker_desc+0x86/0xb0 [97047.966982] [<ffffffff8120c005>] bdi_writeback_workfn+0x115/0x460 [97047.966985] [<ffffffff8109d5fb>] process_one_work+0x17b/0x470 [97047.966986] [<ffffffff8109e3cb>] worker_thread+0x11b/0x400 [97047.966988] [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400 [97047.966990] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [97047.966992] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [97047.966994] [<ffffffff816467d8>] ret_from_fork+0x58/0x90 [97047.966996] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [97047.967003] INFO: task bridge_exec:79205 blocked for more than 120 seconds. [97047.974801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >From crash dump, all the process stuck at UN state has similar backtrace: : 69088 TASK: ffff8810482be780 CPU: 13 COMMAND: "kworker/u386:2" #0 [ffff880b8b663888] __schedule at ffffffff8163b15d #1 [ffff880b8b6638f0] schedule at ffffffff8163b7f9 #2 [ffff880b8b663900] wait_transaction_locked at ffffffffa01fe085 [jbd2] #3 [ffff880b8b663958] start_this_handle at ffffffffa01fe400 [jbd2] #4 [ffff880b8b6639f8] jbd2__journal_start at ffffffffa01fe933 [jbd2] #5 [ffff880b8b663a40] __ext4_journal_start_sb at ffffffffa03a8ea9 [ext4] #6 [ffff880b8b663a80] ext4_writepages at ffffffffa0377fc4 [ext4] #7 [ffff880b8b663bb8] do_writepages at ffffffff81175cde #8 [ffff880b8b663bc8] __writeback_single_inode at ffffffff81208a20 #9 [ffff880b8b663c08] writeback_sb_inodes at ffffffff8120948e #10 [ffff880b8b663cb0] __writeback_inodes_wb at ffffffff812096ef #11 [ffff880b8b663cf8] wb_writeback at ffffffff81209f33 #12 [ffff880b8b663d70] bdi_writeback_workfn at ffffffff8120c005 #13 [ffff880b8b663e20] process_one_work at ffffffff8109d5fb #14 [ffff880b8b663e68] worker_thread at ffffffff8109e3cb #15 [ffff880b8b663ec8] kthread at ffffffff810a5aef #16 [ffff880b8b663f50] ret_from_fork at ffffffff816467d8 PID: 79204 TASK: ffff88084b050b80 CPU: 21 COMMAND: "bridge_exec" #0 [ffff880d7795fc70] __schedule at ffffffff8163b15d #1 [ffff880d7795fcd8] schedule at ffffffff8163b7f9 #2 [ffff880d7795fce8] do_exit at ffffffff81081504 #3 [ffff880d7795fd78] do_group_exit at ffffffff81081dff #4 [ffff880d7795fda8] get_signal_to_deliver at ffffffff81092c10 #5 [ffff880d7795fe40] do_signal at ffffffff81014417 #6 [ffff880d7795ff30] do_notify_resume at ffffffff81014adf #7 [ffff880d7795ff50] int_signal at ffffffff81646b3d RIP: 00007fb2e6b2a9b9 RSP: 00007fff14cbc3d8 RFLAGS: 00000246 RAX: fffffffffffffe00 RBX: 00000000048e2020 RCX: ffffffffffffffff RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000000489bcd8 RBP: 00007fff14cbc420 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000048e2258 R13: 000000000471d898 R14: 00000000048e2188 R15: 000000000489bcc0 ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b Any advice or suggestion will be appreciated! Best Regards, Roy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html