Timer Commands Getting Stuck on Completion Path After Port Timeout

"Pocas, Jamie" <Jamie.Pocas@xxxxxxx> · Thu, 10 Dec 2015 14:57:38 +0000

Hi,

I'm running into a problem where commands get stuck in the completion callback path. Even simple commands like TEST UNIT READY which are constant-time and unconditionally return success status are getting stuck. I can tell that the command is able to get through tcm_qla2xxx_handle_cmd() down to sbc_parse_cdb(), because if I give some invalid fields in the CDB, I can see the log messages from target_cmd_size_check(). I'm wondering if when the port state goes down, we don't properly cancel/fail pending commands or a wake up event is missed and so the queue in tcm_qla2xxx is stuck with commands that will never get completed. Then new incoming commands get blocked behind these stuck ones.

So far, I have observed this only with LUN 0 and so that initially pointed me at some kind of issue where maybe the virtual LUN 0 was somehow involved, but that might be a red herring because also LUN 0 is used more frequently especially during initial discovery after a re-login like this. So it could just be a coincidence that it's LUN 0 and not another LU. When it gets into this state, commands to other LUs work fine, but LUN 0 times out even for totally emulated commands (TUR, READ CAPACITY, INQUIRYs, etc..).

Here's what the messages look like.

Dec  1 13:30:45 localhost kernel: [940223.338746]  rport-4:0-4: blocked FC remote port time out: removing rport
Dec  1 13:30:45 localhost kernel: [940223.338751]  rport-2:0-1: blocked FC remote port time out: removing rport
Dec  1 13:30:45 localhost kernel: [940223.338757]  rport-2:0-2: blocked FC remote port time out: removing rport
Dec  1 13:30:45 localhost kernel: [940223.346097]  rport-4:0-1: blocked FC remote port time out: removing rport
Dec  1 13:30:45 localhost kernel: [940223.346144] qla2xxx [0000:07:00.1]-287d:2: FCPort state transitioned from LOST to DEAD - portid=010400.
Dec  1 13:30:45 localhost kernel: [940223.346212] qla2xxx [0000:07:00.1]-287d:2: FCPort state transitioned from LOST to DEAD - portid=010500.
Dec  1 13:30:45 localhost kernel: [940223.355124] qla2xxx [0000:08:00.1]-287d:4: FCPort state transitioned from LOST to DEAD - portid=010400.
Dec  1 13:30:45 localhost kernel: [940223.359424] qla2xxx [0000:08:00.1]-287d:4: FCPort state transitioned from LOST to DEAD - portid=010500.
Dec  1 13:30:50 localhost kernel: [940228.292849] qla2xxx [0000:07:00.1]-1820:2: **** Failed mbx[0]=4005, mb[1]=a, mb[2]=bc1c, mb[3]=0, cmd=75 ****.
Dec  1 13:30:50 localhost kernel: [940228.297732] qla2xxx [0000:07:00.1]-1820:2: **** Failed mbx[0]=4005, mb[1]=a, mb[2]=bc1c, mb[3]=0, cmd=75 ****.
Dec  1 13:30:50 localhost kernel: [940228.308853] qla2xxx [0000:08:00.1]-1820:4: **** Failed mbx[0]=4005, mb[1]=a, mb[2]=fe25, mb[3]=d000, cmd=75 ****.
Dec  1 13:30:50 localhost kernel: [940228.312826] qla2xxx [0000:08:00.1]-1820:4: **** Failed mbx[0]=4005, mb[1]=a, mb[2]=fe25, mb[3]=d000, cmd=75 ****.
# Note the approx. 120 second gap here!
Dec  1 13:33:01 localhost kernel: [940359.153963] INFO: task kworker/u480:0:9358 blocked for more than 120 seconds.
Dec  1 13:33:01 localhost kernel: [940359.156093] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  1 13:33:01 localhost kernel: [940359.157164] kworker/u480:0  D ffff882f7d873680     0  9358      2 0x00000080
Dec  1 13:33:01 localhost kernel: [940359.158258] Workqueue: tmr-iblock target_tmr_work [target_core_mod]
Dec  1 13:33:01 localhost kernel: [940359.159338]  ffff881da77afc40 0000000000000046 ffff882f0f11ad80 ffff881da77affd8
Dec  1 13:33:01 localhost kernel: [940359.160441]  ffff881da77affd8 ffff881da77affd8 ffff882f0f11ad80 ffff8820bc3411c8
Dec  1 13:33:01 localhost kernel: [940359.161541]  ffff8820bc3411d0 7fffffffffffffff ffff882f0f11ad80 000000000012a370
Dec  1 13:33:01 localhost kernel: [940359.162635] Call Trace:
Dec  1 13:33:01 localhost kernel: [940359.163700]  [<ffffffff8160acf9>] schedule+0x29/0x70
Dec  1 13:33:01 localhost kernel: [940359.164768]  [<ffffffff81608c49>] schedule_timeout+0x209/0x2d0
Dec  1 13:33:01 localhost kernel: [940359.165832]  [<ffffffff81144239>] ? irq_work_queue+0x69/0xc0
Dec  1 13:33:01 localhost kernel: [940359.166892]  [<ffffffff810702c4>] ? wake_up_klogd+0x34/0x50
Dec  1 13:33:01 localhost kernel: [940359.167938]  [<ffffffff810704e8>] ? console_unlock+0x208/0x400
Dec  1 13:33:01 localhost kernel: [940359.168979]  [<ffffffff8160b1f6>] wait_for_completion+0x116/0x170
Dec  1 13:33:01 localhost kernel: [940359.170020]  [<ffffffff810a9500>] ? wake_up_state+0x20/0x20
Dec  1 13:33:01 localhost kernel: [940359.171067]  [<ffffffffa05e9cb7>] transport_wait_for_tasks+0xd7/0x200 [target_core_mod]
Dec  1 13:33:01 localhost kernel: [940359.172122]  [<ffffffffa05e6e8d>] core_tmr_abort_task+0x15d/0x230 [target_core_mod]
Dec  1 13:33:01 localhost kernel: [940359.173175]  [<ffffffffa05ea0c7>] target_tmr_work+0xb7/0xd0 [target_core_mod]
Dec  1 13:33:01 localhost kernel: [940359.174209]  [<ffffffff8108f0bb>] process_one_work+0x17b/0x470
Dec  1 13:33:01 localhost kernel: [940359.175220]  [<ffffffff8108fe8b>] worker_thread+0x11b/0x400
Dec  1 13:33:01 localhost kernel: [940359.176206]  [<ffffffff8108fd70>] ? rescuer_thread+0x400/0x400
Dec  1 13:33:01 localhost kernel: [940359.177169]  [<ffffffff8109726f>] kthread+0xcf/0xe0
Dec  1 13:33:01 localhost kernel: [940359.178104]  [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Dec  1 13:33:01 localhost kernel: [940359.179027]  [<ffffffff81615798>] ret_from_fork+0x58/0x90
Dec  1 13:33:01 localhost kernel: [940359.179952]  [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140

It's waiting in transport_wait_for_tasks(), which is being called by core_tmr_abort_task(), in turn by a workqueue running target_tmr_work(). If I understand correctly, these work items are enqueued in tcm_qla2xxx:: tcm_qla2xxx_handle_tmr()->target_submit_tmr() (noted the FIXME in here if memory allocation fails, but I don't think there's memory pressure in my case). So I'm not sure if the problem is that when the connection gets "reset" (not an FCP expert here, cut me some slack) on the qla2xxx/tcm_qla2xxx side not properly sending aborts down or something in target_core_mod not calling complete() for these commands. Well in any event, it seems nobody is calling complete() for these, but I'm not sure whose responsibility that is. Normally it should be something in target_core_transport.c, right? I would think when the port goes offline, tcm_qla2xxx would almost want some way to purge all the tmr commands, failing all of them since the connection was "reset" (again being vague here about FCP).

I'm running on kernel 3.10.0-229.20.1.el7.x86_64, which is "old" but is the latest RHEL7/CentOS7 kernel.

Has anyone else seen this or has some insight on what's happening? It may be fixed in a later kernel but I am trying to stay on supported RHEL7 kernel versions.
If someone knows of a specific patch to address this though, I could cherry pick that.

Thanks,
Jamie
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html