On Sat, 2016-10-29 at 14:10 -0400, TomK wrote: > On 10/29/2016 3:50 AM, Nicholas A. Bellinger wrote: > > Hi TomK & Co, > > > > On Fri, 2016-10-28 at 02:01 -0400, TomK wrote: > >> On 10/26/2016 8:08 AM, TomK wrote: > >>> On 10/26/2016 3:20 AM, Nicholas A. Bellinger wrote: > > <SNIP> > >> > >> 1) As soon as I re-add the bad disk without the patch, I loose the LUN > >> off the ESXi hosts. Same thing happens with the patch. No change. The > >> disk is pulling things down. Worse, the kernel panics and locks me out > >> of the system (http://microdevsys.com/linux-lio/messages-oct-27-2016.txt) : > >> > > > > So after groking these logs, the point when ata6 failing scsi_device is > > holding outstanding I/O beyond ESX FC host side timeouts, manifests > > itself as ABORT_TASK tag=1122276: > > > > Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077e > > Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 077f > > Oct 28 00:42:56 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0780 > > Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0 > > Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0) > > Oct 28 00:42:57 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80f:20: qla_target(0): task abort (tag=1122276) > > Oct 28 00:42:57 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1122276 > > > > and eventually: > > > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20: ABTS_RECV_24XX: instance 0 > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20: qla_target(0): task abort (s_id=1:5:0, tag=1122276, param=0) > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f881:20: unable to find cmd in driver or LIO for tag 0x111fe4 > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f854:20: qla_target(0): __qlt_24xx_handle_abts() failed: -2 > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a40, status=4 > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31 > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000) > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f123a80, status=0 > > Oct 28 00:43:18 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0 > > > > The outstanding se_cmd with tag=1122276 for ata6 completes back to > > target-core, allowing ABORT_TASK + TMR_FUNCTION_COMPLETE status to > > progress: > > > > Oct 28 00:44:25 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts > > Oct 28 00:44:29 mbpc-pc kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) > > Oct 28 00:44:29 mbpc-pc kernel: ata6.00: configured for UDMA/133 > > Oct 28 00:44:29 mbpc-pc kernel: ata6.00: retrying FLUSH 0xea Emask 0x4 > > Oct 28 00:44:29 mbpc-pc kernel: ata6.00: device reported invalid CHS sector 0 > > Oct 28 00:44:29 mbpc-pc kernel: ata6: EH complete > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000) > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: qla_target(0): terminating exchange for aborted cmd=ffff880099392fa8 (se_cmd=ffff880099392fa8, tag=1122276) > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000) > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: qla_target(0): Terminating cmd ffff880099392fa8 with incorrect state 2 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 6) > > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1122276 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: Intermediate CTIO received (status 8) > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f0f0) status 0x0 state 0x0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f300, status=0 > > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122996 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f340) status 0x2 state 0x0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f550, status=2 > > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122204 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f590) status 0x2 state 0x0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f7a0, status=2 > > Oct 28 00:44:30 mbpc-pc kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1122240 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f813:20: TM response mcmd (ffff8800b1e2f7e0) status 0x2 state 0x0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000) > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f124280, status=0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20: ABTS_RESP_24XX: compl_status 31 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending retry TERM EXCH CTIO7 (ha=ffff88010f110000) > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff88010f1242c0, status=0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending task mgmt ABTS response (ha=ffff88010f110000, atio=ffff8800b1e2f9f0, status=2 > > > > and continues with TMR_TASK_DOES_NOT_EXIST for other oustanding tags.. > > > > Until, target-core session release for two tcm_qla2xxx host ports: > > > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff880099a80a80 / sess ffff880111665cc0 from port 21:03:00:1b:32:74:b6:cb loop_id 0x03 s_id 01:05:00 logout 1 keep 1 els_logo 0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff8800993849f8] ox_id 0773 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f884:20: qlt_free_session_done: se_sess ffff88009e8f9040 / sess ffff8800abe95480 from port 50:01:43:80:16:77:99:38 loop_id 0x02 s_id 01:00:00 logout 1 keep 1 els_logo 0 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff880099199f38] ox_id 05f7 > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0. > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=2 loop-id=3 portid=010500. > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-2870:20: Async-logout - hdl=3 loop-id=2 portid=010000. > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping. > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000) > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM EXCH CTIO (ha=ffff88010f110000) > > > > From there, ELS with unexpected NOTIFY_ACK received start to occur: > > > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4801:20: DPC handler waking up, dpc_flags=0x0. > > Oct 28 00:44:30 mbpc-pc kernel: qla2xxx [0000:04:00.0]-4800:20: DPC handler sleeping. > > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000 > > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO > > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03 > > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000) > > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1 > > Oct 28 00:44:34 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received > > > > ELS packets for the same two host ports continue: > > > > Oct 28 00:46:40 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts > > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000 > > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO > > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:00:01 ELS opcode: 0x03 > > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000) > > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff8800abe95480 [0] wwn 50:01:43:80:16:77:99:38 with PLOGI ACK to wwn 50:01:43:80:16:77:99:38 s_id 01:00:00, ref=1 > > Oct 28 00:46:40 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received > > Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05b1 > > Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a9977d8] > > Oct 28 00:46:46 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a9977d8] ox_id 05b1 > > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type d ox_id 0000 > > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e82e:20: IMMED_NOTIFY ATIO > > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f826:20: qla_target(0): Port ID: 0x00:05:01 ELS opcode: 0x03 > > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending TERM ELS CTIO (ha=ffff88010f110000) > > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f897:20: Linking sess ffff880111665cc0 [0] wwn 21:03:00:1b:32:74:b6:cb with PLOGI ACK to wwn 21:03:00:1b:32:74:b6:cb s_id 01:05:00, ref=1 > > Oct 28 00:46:49 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e862:20: qla_target(0): Unexpected NOTIFY_ACK received > > > > And eventually, the 360 second hung task timeout warnings appear: > > > > Oct 28 00:49:48 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts > > Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05ba > > Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508] > > Oct 28 00:49:55 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05ba > > Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 05bb > > Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e818:20: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=1, cmd->dma_data_direction=1 se_cmd[ffff88009a988508] > > Oct 28 00:50:16 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: qlt_free_cmd: se_cmd[ffff88009a988508] ox_id 05bb > > Jan 3 15:34:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon successfully started > > Oct 28 00:50:33 mbpc-pc kernel: INFO: task kworker/0:2:31731 blocked for more than 360 seconds. > > Oct 28 00:50:33 mbpc-pc kernel: Not tainted 4.8.4 #2 > > Oct 28 00:50:33 mbpc-pc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Oct 28 00:50:33 mbpc-pc kernel: kworker/0:2 D ffff88011affb968 0 31731 2 0x00000080 > > Oct 28 00:50:33 mbpc-pc kernel: Workqueue: events qlt_free_session_done [qla2xxx] > > Oct 28 00:50:33 mbpc-pc kernel: ffff88011affb968 ffff88011affb8d8 ffff880013514940 0000000000000006 > > Oct 28 00:50:33 mbpc-pc kernel: ffff8801140fe880 ffffffff81f998c2 0000000000000000 ffff880100000000 > > Oct 28 00:50:33 mbpc-pc kernel: ffffffff810bdaaa ffffffff00000000 ffffffff00000051 ffff880100000000 > > Oct 28 00:50:33 mbpc-pc kernel: Call Trace: > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810bdaaa>] ? vprintk_emit+0x27a/0x4d0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162e7ec>] schedule_timeout+0x9c/0xe0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162cfa0>] wait_for_completion+0xc0/0xf0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810923e0>] ? try_to_wake_up+0x260/0x260 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81161d73>] ? mempool_free+0x33/0x90 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa08f76ad>] target_wait_for_sess_cmds+0x4d/0x1b0 [target_core_mod] > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa00e7188>] ? qla2x00_post_work+0x58/0x70 [qla2xxx] > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa0286f69>] tcm_qla2xxx_free_session+0x49/0x90 [tcm_qla2xxx] > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffffa01447e9>] qlt_free_session_done+0xf9/0x3d0 [qla2xxx] > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81080639>] process_one_work+0x189/0x4e0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8107d915>] ? wq_worker_waking_up+0x15/0x70 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109eb59>] ? idle_balance+0x79/0x290 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8108150d>] worker_thread+0x16d/0x520 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162bb3d>] ? __schedule+0x2fd/0x6a0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff810813a0>] ? maybe_create_worker+0x110/0x110 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8109130e>] ? schedule_tail+0x1e/0xc0 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40 > > Oct 28 00:50:33 mbpc-pc kernel: [<ffffffff81085f20>] ? kthread_freezable_should_stop+0x70/0x70 > > > >> > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: > >> Intermediate CTIO received (status 6) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending > >> TERM EXCH CTIO (ha=ffff88010ecb0000) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f814:20: > >> qla_target(0): terminating exchange for aborted cmd=ffff88009af9f488 > >> (se_cmd=ffff88009af9f488, tag=1131312) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending > >> TERM EXCH CTIO (ha=ffff88010ecb0000) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e851:20: > >> qla_target(0): Terminating cmd ffff88009af9f488 with incorrect state 2 > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending > >> TERM EXCH CTIO (ha=ffff88010ecb0000) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: > >> Intermediate CTIO received (status 6) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: > >> Intermediate CTIO received (status 8) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e81c:20: Sending > >> TERM EXCH CTIO (ha=ffff88010ecb0000) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f81d:20: > >> Intermediate CTIO received (status 6) > >> Oct 28 01:19:59 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e874:20: > >> qlt_free_cmd: se_cmd[ffff88009af9f488] ox_id 0673 > >> Oct 28 01:19:59 mbpc-pc kernel: ------------[ cut here ]------------ > >> Oct 28 01:19:59 mbpc-pc kernel: kernel BUG at > >> drivers/scsi/qla2xxx/qla_target.c:3319! > >> Oct 28 01:19:59 mbpc-pc kernel: invalid opcode: 0000 [#1] SMP > >> Oct 28 01:19:59 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc > >> tcm_loop target_core_file target_core_iblock target_core_pscsi > >> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables > >> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM > >> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87 > >> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev > >> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt > >> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse > >> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456 > >> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq > >> libcrc32c joydev sg serio_raw e1000 r8169 mii kvm_amd kvm > >> snd_hda_codec_realtek snd_hda_codec_generic irqbypass pcspkr k10temp > >> snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq > >> snd_seq_device snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core wmi > >> shpchp acpi_cpufreq ext4 mbcache jbd2 qla2xxx scsi_transport_fc floppy > >> firewire_ohci f > >> Oct 28 01:19:59 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi > >> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror > >> dm_region_hash dm_log dm_mod > >> Oct 28 01:19:59 mbpc-pc kernel: CPU: 0 PID: 296 Comm: kworker/u16:6 Not > >> tainted 4.8.4 #2 > >> Oct 28 01:19:59 mbpc-pc kernel: Hardware name: Gigabyte Technology Co., > >> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010 > >> Oct 28 01:19:59 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work > >> [target_core_mod] > >> Oct 28 01:19:59 mbpc-pc kernel: task: ffff8801109623c0 task.stack: > >> ffff880110968000 > >> Oct 28 01:19:59 mbpc-pc kernel: RIP: 0010:[<ffffffffa0143810>] > >> [<ffffffffa0143810>] qlt_free_cmd+0x160/0x180 [qla2xxx] > >> Oct 28 01:19:59 mbpc-pc kernel: RSP: 0018:ffff88011096bb18 EFLAGS: 00010202 > >> Oct 28 01:19:59 mbpc-pc kernel: RAX: 0000000000000051 RBX: > >> ffff88009af9f488 RCX: 0000000000000006 > >> Oct 28 01:19:59 mbpc-pc kernel: RDX: 0000000000000007 RSI: > >> 0000000000000007 RDI: ffff88011fc0cb40 > >> Oct 28 01:19:59 mbpc-pc kernel: RBP: ffff88011096bb48 R08: > >> 0000000000000000 R09: ffffffff81fa4765 > >> Oct 28 01:19:59 mbpc-pc kernel: R10: 0000000000000074 R11: > >> 0000000000000002 R12: ffff8801137770c0 > >> Oct 28 01:19:59 mbpc-pc kernel: R13: ffff8800a126eaf0 R14: > >> ffff88009af9f510 R15: 0000000000000296 > >> Oct 28 01:19:59 mbpc-pc kernel: FS: 0000000000000000(0000) > >> GS:ffff88011fc00000(0000) knlGS:0000000000000000 > >> Oct 28 01:19:59 mbpc-pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > >> 0000000080050033 > >> Oct 28 01:19:59 mbpc-pc kernel: CR2: 00007f8eef58d000 CR3: > >> 00000000cabad000 CR4: 00000000000006f0 > >> Oct 28 01:19:59 mbpc-pc kernel: Stack: > >> Oct 28 01:19:59 mbpc-pc kernel: ffff880000000673 ffff88009af9f488 > >> ffff8800a126eaf0 ffff88009af9f59c > >> Oct 28 01:19:59 mbpc-pc kernel: ffff88009af9f488 ffff8800a126eaf0 > >> ffff88011096bb58 ffffffffa027f7f4 > >> Oct 28 01:19:59 mbpc-pc kernel: ffff88011096bbb8 ffffffffa08f758c > >> ffff88009af9f510 ffff88009af9f488 > >> Oct 28 01:19:59 mbpc-pc kernel: Call Trace: > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa027f7f4>] > >> tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx] > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f758c>] > >> target_release_cmd_kref+0xac/0x110 [target_core_mod] > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f7627>] > >> target_put_sess_cmd+0x37/0x70 [target_core_mod] > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f46f7>] > >> core_tmr_abort_task+0x107/0x160 [target_core_mod] > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffffa08f6aa4>] > >> target_tmr_work+0x154/0x160 [target_core_mod] > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81080639>] > >> process_one_work+0x189/0x4e0 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810d060c>] ? > >> del_timer_sync+0x4c/0x60 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108131e>] ? > >> maybe_create_worker+0x8e/0x110 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8108150d>] > >> worker_thread+0x16d/0x520 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810923f2>] ? > >> default_wake_function+0x12/0x20 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810a6f06>] ? > >> __wake_up_common+0x56/0x90 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ? > >> maybe_create_worker+0x110/0x110 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff810813a0>] ? > >> maybe_create_worker+0x110/0x110 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8109130e>] ? > >> schedule_tail+0x1e/0xc0 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff8162f60f>] ret_from_fork+0x1f/0x40 > >> Oct 28 01:19:59 mbpc-pc kernel: [<ffffffff81085f20>] ? > >> kthread_freezable_should_stop+0x70/0x70 > >> Oct 28 01:19:59 mbpc-pc kernel: Code: 0d 00 00 48 c7 c7 00 2a 16 a0 e8 > >> ac 32 f2 e0 48 83 c4 18 5b 41 5c 41 5d c9 c3 48 8b bb 90 02 00 00 e8 85 > >> da 07 e1 e9 30 ff ff ff <0f> 0b eb fe 0f 0b 66 2e 0f 1f 84 00 00 00 00 > >> 00 eb f4 66 66 66 > >> Oct 28 01:19:59 mbpc-pc kernel: RIP [<ffffffffa0143810>] > >> qlt_free_cmd+0x160/0x180 [qla2xxx] > >> Oct 28 01:19:59 mbpc-pc kernel: RSP <ffff88011096bb18> > >> Oct 28 01:19:59 mbpc-pc kernel: ---[ end trace 2551bf47a19dbe2e ]--- > >> > > > > Mmm. > > > > This BUG_ON is signaling a qla_tgt_cmd descriptor is being freed while > > qlt_handle_cmd_for_atio() has queued it for backend execution, but > > qla_tgt_cmd->work -> __qlt_do_work() has not executed. > > > >> > >> > >> 2) This works with a new disk that's just been inserted. No issues. > >> > >> > > > > Thanks for verifying this scenario. > > > >> > >> > >> The kernel had the patch in both scenarios. So appears we can't > >> function on a degraded array that looses 1 (RAID 5 / 6 ) or 2 ( RAID 6 ) > >> disks, at the moment, even though the array itself is fine. Perhaps > >> it's the nature of the failed disk. > >> > > > > AFAICT, the hung task involves ABORT_TASK across tcm_qla2xxx session > > reinstatement, when backend I/O latency is high enough to cause > > ABORT_TASK operations across FC fabric host side session reset. > > > > From logs alone, it's unclear if the failing backend ata6 is leaking I/O > > (indefinitely) when hung task warning happen, but the preceding ata6 > > failures + device resets seem to indicate I/O completions are just > > taking a really long time to complete. > > > > Also, it's unclear if the BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() is a > > side effect of the earlier hung task, or separate tcm_qla2xxx session > > reinstatement bug. > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > Thanks Nicholas. > > Is it possible the RAID 6 array, after attempting to write sectors to a > bad disk, received an error and returned a status message that the QLA > could not interpret? Thereby sending the QLA into a routine expecting a > recognized message and simply timing out? All the while the array > simply 'moved on' along it's merry way. (For example, a message akin > to "Unexpected xyz received, ignoring" type of message that should be > interpreted and actioned on.) Or perhaps the software RAID 6 isn't > returning anything meaningful to the QLA driver. (If it isn't though, > it can be argued that the QLA / Target Drivers shouldn't have to care if > the array is working.) So target-core waits for all outstanding backend I/O to complete during session reinstatement. The two hung task warnings observed here mean target-core I/O descriptors from each two tcm_qla2xxx ports are waiting to be completed during session reinstatement, but this never happens. Which means one of three things: 1) The backend below target-core is leaking I/O completions, which is a bug outside of target-core and needs to be identified. 2) There is a target-core bug leaking se_cmd->cmd_kref, preventing the final reference release to occur. 3) There is a tcm_qla2xxx bug leaking se_cmd->cmd_kref, preventing the final reference release to occur. > > The 100% util on /dev/sdf lasted for less then 30 seconds. Based on the above, between when tag=1122276 got ABORT_TASK and ata6 finally gave back the descriptor to allow TMR_FUNCTION_COMPLETE was at least 90 seconds. This does not include the extra time from initial I/O submission to when ESX SCSI timeouts fire to generate ABORT_TASK, which IIRC depends upon the FC host LLD. Btw, I don't recall libata device timeouts being 90+ seconds plus, which looks a little strange.. This timeout is in /sys/class/scsi_device/$HCTL/device/eh_timeout. So assuming MD is not holding onto I/O much beyond backend device eh_timeout during failure, a simple work-around is to keep the combined backend eh_timeout and MD consumer I/O timeout lower than ESX FC SCSI timeouts. >From the logs above, it would likely mask this specific bug if it's related to #2 or #3. > Well below > the 120s default timeout (I've upped this to 360 for the purpose of > testing this scenario) > > I do see these: > > http://microdevsys.com/linux-lio/messages-mailing-list > Oct 23 19:50:26 mbpc-pc kernel: qla2xxx [0000:04:00.0]-5811:20: > Asynchronous PORT UPDATE ignored 0000/0004/0600. > > but I don't think it's that one as it doesn't correlate well at all. > These are unrelated to the hung task warnings. > If there's another test that I can do to get more info, please feel free > to suggest. > Ideally, being able to generate a vmcore crashdump is the most helpful. This involves manually triggering a crash after you observe the hung tasks. It's reasonably easy to setup once CONFIG_KEXEC=y + CONFIG_CRASH_DUMP=y are enabled: https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes I don't know if you'll run into problems on v4.8.y as per Anil's earlier email, but getting a proper vmcore for analysis is usually the fastest path to root cause. Short of that, it would be helpful to identify the state of the se_cmd descriptors getting leaked. Here's a quick patch to add some more verbosity: diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 7dfefd6..9b93a2c 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2657,9 +2657,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess) list_for_each_entry_safe(se_cmd, tmp_cmd, &se_sess->sess_wait_list, se_cmd_list) { - pr_debug("Waiting for se_cmd: %p t_state: %d, fabric state:" - " %d\n", se_cmd, se_cmd->t_state, - se_cmd->se_tfo->get_cmd_state(se_cmd)); + printk("Waiting for se_cmd: %p t_state: %d, fabric state:" + " %d se_cmd_flags: 0x%08x transport_state: 0x%08x" + " CDB: 0x%02x\n", + se_cmd, se_cmd->t_state, + se_cmd->se_tfo->get_cmd_state(se_cmd), + se_cmd->se_cmd_flags, se_cmd->transport_state, + se_cmd->t_task_cdb[0]); spin_lock_irqsave(&se_cmd->t_state_lock, flags); tas = (se_cmd->transport_state & CMD_T_TAS); @@ -2671,9 +2675,13 @@ void target_wait_for_sess_cmds(struct se_session *se_sess) } wait_for_completion(&se_cmd->cmd_wait_comp); - pr_debug("After cmd_wait_comp: se_cmd: %p t_state: %d" - " fabric state: %d\n", se_cmd, se_cmd->t_state, - se_cmd->se_tfo->get_cmd_state(se_cmd)); + printk("After cmd_wait_comp: se_cmd: %p t_state: %d" + " fabric state: %d se_cmd_flags: 0x%08x transport_state:" + " 0x%08x CDB: 0x%02x\n", + se_cmd, se_cmd->t_state, + se_cmd->se_tfo->get_cmd_state(se_cmd), + se_cmd->se_cmd_flags, se_cmd->transport_state, + se_cmd->t_task_cdb[0]); se_cmd->se_tfo->release_cmd(se_cmd); } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html