On 12/21/2016 6:39 PM, Robert LeBlanc wrote: > I hit a new backtrace today, hopefully it adds something. > > # cat /proc/19659/stack > [<ffffffff815304d1>] iscsit_stop_session+0x1b1/0x1c0 > [<ffffffff81521c62>] iscsi_check_for_session_reinstatement+0x1e2/0x270 > [<ffffffff81524660>] iscsi_target_check_for_existing_instances+0x30/0x40 > [<ffffffff815247a8>] iscsi_target_do_login+0x138/0x630 > [<ffffffff815259be>] iscsi_target_start_negotiation+0x4e/0xa0 > [<ffffffff8152355e>] __iscsi_target_login_thread+0x83e/0xf20 > [<ffffffff81523c64>] iscsi_target_login_thread+0x24/0x30 > [<ffffffff810a3059>] kthread+0xd9/0xf0 > [<ffffffff817732d5>] ret_from_fork+0x25/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > # cat /proc/21342/stack > [<ffffffffa0292b10>] __ib_drain_sq+0x190/0x1c0 [ib_core] > [<ffffffffa0292b65>] ib_drain_sq+0x25/0x30 [ib_core] > [<ffffffffa0292d72>] ib_drain_qp+0x12/0x30 [ib_core] > [<ffffffffa062c5ff>] isert_wait_conn+0x5f/0x2d0 [ib_isert] > [<ffffffff815309b7>] iscsit_close_connection+0x157/0x860 > [<ffffffff8151f10b>] iscsit_take_action_for_connection_exit+0x7b/0xf0 > [<ffffffff81530265>] iscsi_target_rx_thread+0x95/0xa0 > [<ffffffff810a3059>] kthread+0xd9/0xf0 > [<ffffffff817732d5>] ret_from_fork+0x25/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > # ps aux | grep iscsi | grep D > root 19659 0.0 0.0 0 0 ? D 16:12 0:00 [iscsi_np] > root 21342 0.0 0.0 0 0 ? D 16:29 0:00 [iscsi_trx] > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 That looks suspiciously like the __ib_drain_sq is stuck forever waiting on a completion that never comes. > > On Thu, Dec 15, 2016 at 1:38 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: >> Nicholas, >> >> I've found that the kernels I used were not able to be inspected using >> crash and I could not build the debug info for them. So I built a new >> 4.9 kernel and verified that I could inspect the crash. It is located >> at [1]. >> >> [1] http://mirrors.betterservers.com/trace/crash2.tar.xz >> ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Mon, Dec 12, 2016 at 4:57 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: >>> Nicholas, >>> >>> After lots of set backs and having to give up trying to get kernel >>> dumps on our "production" systems, I've been able to work out the >>> issues we had with kdump and replicate the issue on my dev boxes. I >>> have dumps from 4.4.30 and 4.9-rc8 (makedumpfile would not dump, so it >>> is a straight copy of /proc/vmcore from the crash kernel). In each >>> crash directory, I put a details.txt file that has the process IDs >>> that were having problems and a brief description of the set-up at the >>> time. This was mostly replicated by starting fio and pulling the >>> Infiniband cable until fio gave up. This hardware also has Mellanox >>> ConnectX4-LX cards and I also replicated the issue over RoCE using 4.9 >>> since it has the drivers in-box. Please let me know if you need more >>> info, I can test much faster now. The cores/kernels/modules are >>> located at [1]. >>> >>> [1] http://mirrors.betterservers.com/trace/crash.tar.xz >>> >>> Thanks, >>> Robert >>> ---------------- >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >>> >>> On Fri, Nov 4, 2016 at 3:57 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: >>>> We hit this yesterday, this time it was on the tx thread (the other >>>> ones before seem to be on the rx thread). We weren't able to get a >>>> kernel dump on this. We'll try to get one next time. >>>> >>>> # ps axuw | grep "D.*iscs[i]" >>>> root 12383 0.0 0.0 0 0 ? D Nov03 0:04 [iscsi_np] >>>> root 23016 0.0 0.0 0 0 ? D Nov03 0:00 [iscsi_ttx] >>>> root 23018 0.0 0.0 0 0 ? D Nov03 0:00 [iscsi_ttx] >>>> # cat /proc/12383/stack >>>> [<ffffffff814f24af>] iscsit_stop_session+0x19f/0x1d0 >>>> [<ffffffff814e3c66>] iscsi_check_for_session_reinstatement+0x1e6/0x270 >>>> [<ffffffff814e6620>] iscsi_target_check_for_existing_instances+0x30/0x40 >>>> [<ffffffff814e6770>] iscsi_target_do_login+0x140/0x640 >>>> [<ffffffff814e7b0c>] iscsi_target_start_negotiation+0x1c/0xb0 >>>> [<ffffffff814e585b>] iscsi_target_login_thread+0xa9b/0xfc0 >>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0 >>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70 >>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>> # cat /proc/23016/stack >>>> [<ffffffff814ce0d9>] target_wait_for_sess_cmds+0x49/0x1a0 >>>> [<ffffffffa058b92b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert] >>>> [<ffffffff814f2642>] iscsit_close_connection+0x162/0x870 >>>> [<ffffffff814e110f>] iscsit_take_action_for_connection_exit+0x7f/0x100 >>>> [<ffffffff814f122a>] iscsi_target_tx_thread+0x1aa/0x1d0 >>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0 >>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70 >>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>> # cat /proc/23018/stack >>>> [<ffffffff814ce0d9>] target_wait_for_sess_cmds+0x49/0x1a0 >>>> [<ffffffffa058b92b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert] >>>> [<ffffffff814f2642>] iscsit_close_connection+0x162/0x870 >>>> [<ffffffff814e110f>] iscsit_take_action_for_connection_exit+0x7f/0x100 >>>> [<ffffffff814f122a>] iscsi_target_tx_thread+0x1aa/0x1d0 >>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0 >>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70 >>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>> >>>> From dmesg: >>>> [ 394.476332] INFO: rcu_sched self-detected stall on CPU >>>> [ 394.476334] 20-...: (23976 ticks this GP) >>>> idle=edd/140000000000001/0 softirq=292/292 fqs=18788 >>>> [ 394.476336] (t=24003 jiffies g=3146 c=3145 q=0) >>>> [ 394.476337] Task dump for CPU 20: >>>> [ 394.476338] kworker/u68:2 R running task 0 12906 2 0x00000008 >>>> [ 394.476345] Workqueue: isert_comp_wq isert_cq_work [ib_isert] >>>> [ 394.476346] ffff883f2fe38000 00000000f805705e ffff883f7fd03da8 >>>> ffffffff810ac8ff >>>> [ 394.476347] 0000000000000014 ffffffff81adb680 ffff883f7fd03dc0 >>>> ffffffff810af239 >>>> [ 394.476348] 0000000000000015 ffff883f7fd03df0 ffffffff810e1cd0 >>>> ffff883f7fd17b80 >>>> [ 394.476348] Call Trace: >>>> [ 394.476354] <IRQ> [<ffffffff810ac8ff>] sched_show_task+0xaf/0x110 >>>> [ 394.476355] [<ffffffff810af239>] dump_cpu_task+0x39/0x40 >>>> [ 394.476357] [<ffffffff810e1cd0>] rcu_dump_cpu_stacks+0x80/0xb0 >>>> [ 394.476359] [<ffffffff810e6100>] rcu_check_callbacks+0x540/0x820 >>>> [ 394.476360] [<ffffffff810afe11>] ? account_system_time+0x81/0x110 >>>> [ 394.476363] [<ffffffff810faa60>] ? tick_sched_do_timer+0x50/0x50 >>>> [ 394.476364] [<ffffffff810eb599>] update_process_times+0x39/0x60 >>>> [ 394.476365] [<ffffffff810fa815>] tick_sched_handle.isra.17+0x25/0x60 >>>> [ 394.476366] [<ffffffff810faa9d>] tick_sched_timer+0x3d/0x70 >>>> [ 394.476368] [<ffffffff810ec182>] __hrtimer_run_queues+0x102/0x290 >>>> [ 394.476369] [<ffffffff810ec668>] hrtimer_interrupt+0xa8/0x1a0 >>>> [ 394.476372] [<ffffffff81052c65>] local_apic_timer_interrupt+0x35/0x60 >>>> [ 394.476374] [<ffffffff8172423d>] smp_apic_timer_interrupt+0x3d/0x50 >>>> [ 394.476376] [<ffffffff817224f7>] apic_timer_interrupt+0x87/0x90 >>>> [ 394.476379] <EOI> [<ffffffff810d71be>] ? console_unlock+0x41e/0x4e0 >>>> [ 394.476380] [<ffffffff810d757c>] vprintk_emit+0x2fc/0x500 >>>> [ 394.476382] [<ffffffff810d78ff>] vprintk_default+0x1f/0x30 >>>> [ 394.476384] [<ffffffff81174dde>] printk+0x5d/0x74 >>>> [ 394.476388] [<ffffffff814bce21>] transport_lookup_cmd_lun+0x1d1/0x200 >>>> [ 394.476390] [<ffffffff814ee8c0>] iscsit_setup_scsi_cmd+0x230/0x540 >>>> [ 394.476392] [<ffffffffa058dbf3>] isert_rx_do_work+0x3f3/0x7f0 [ib_isert] >>>> [ 394.476394] [<ffffffffa058e174>] isert_cq_work+0x184/0x770 [ib_isert] >>>> [ 394.476396] [<ffffffff8109740f>] process_one_work+0x14f/0x400 >>>> [ 394.476397] [<ffffffff81097c84>] worker_thread+0x114/0x470 >>>> [ 394.476398] [<ffffffff8171d32a>] ? __schedule+0x34a/0x7f0 >>>> [ 394.476399] [<ffffffff81097b70>] ? rescuer_thread+0x310/0x310 >>>> [ 394.476400] [<ffffffff8109d7c8>] kthread+0xd8/0xf0 >>>> [ 394.476402] [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60 >>>> [ 394.476403] [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70 >>>> [ 394.476404] [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60 >>>> [ 405.716632] Unexpected ret: -104 send data 360 >>>> [ 405.721711] tx_data returned -32, expecting 360. >>>> ---------------- >>>> Robert LeBlanc >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 When you combine this trace with the newest one, it really makes me thing there is something of a bad interaction between the new drain cq API and the iser/isert implementation to use said API. Sagi, Christoph? -- Doug Ledford <dledford@xxxxxxxxxx> GPG Key ID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
Attachment:
signature.asc
Description: OpenPGP digital signature