On 19-06-17 09:20, Pascal de Bruijn wrote:
Hi,
As I've illustrated in a previous thread:
ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag
We've been experiencing some kernel lockups while using LIO as an FC
target.
We've collected the kernel trace via serial, and this is the second
lockup where we've been able to get proper kernel messages. The trace
is very similar to the previous one, so the issue is practically
reproducible for us, though thus far not at will. This happens once or
twice a week for us, always at night:
Detected MISCOMPARE for addr: ffff91f3c55ac000 buf: ffff91f3d08dd600
Target/iblock: Send MISCOMPARE check condition and sense
ABORT_TASK: Found referenced qla2xxx task_tag: 1203652
ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1203652
ABORT_TASK: Found referenced qla2xxx task_tag: 1203700
------------[ cut here ]------------
kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!
invalid opcode: 0000 [#1] SMP
Modules linked in: tcm_qla2xxx target_core_user uio target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
configfs raid456 async_raid6_recov async_memcpy async_pq async_xor xor
async_tx raid6_pq md_mod ipmi_devintf ses enclosure sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp qla2xxx
scsi_transport_fc mei_me mpt3sas igb mei ioatdma i2c_algo_bit dca wmi
ipmi_si ipmi_msghandler
CPU: 13 PID: 24122 Comm: kworker/u34:0 Not tainted 4.9.30-ul5 #3
Hardware name: Supermicro SYS-1028U-E1CRTP+/X10DRU-i+, BIOS 2.0b
08/09/2016
Workqueue: tmr-iblock target_tmr_work [target_core_mod]
task: ffff91fbd4e42340 task.stack: ffffaba461684000
RIP: 0010:[<ffffffffc0625ba8>] [<ffffffffc0625ba8>]
qlt_free_cmd+0x148/0x160 [qla2xxx]
RSP: 0018:ffffaba461687d30 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff91fbcbd7cd48 RCX: ffffffffc064d692
RDX: 000000000000e074 RSI: ffff91fbd34c65a0 RDI: 0000000000004000
RBP: ffffaba461687d50 R08: ffffffffc063a7d0 R09: ffff91fbcbd7cd48
R10: ffff91f3d9822bc0 R11: ffff91f3ffff9f80 R12: ffff91fbce03b680
R13: 0000000000000246 R14: 0000000000125df4 R15: ffff91fbcb5ffac0
FS: 0000000000000000(0000) GS:ffff91fbffd40000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9b9397ef60 CR3: 0000000366407000 CR4: 00000000001406e0
Stack:
ffff91fb0000048b ffff91fbcbd7cd48 ffff91fbcff9c870 0000000000000246
ffffaba461687d60 ffffffffc0a49f2f ffffaba461687d88 ffffffffc09bf417
ffff91fbcbd7cdd0 ffff91fbcbd7cd48 ffff91eee98b0000 ffffaba461687dd8
Call Trace:
[<ffffffffc0a49f2f>] tcm_qla2xxx_release_cmd+0xf/0x20 [tcm_qla2xxx]
[<ffffffffc09bf417>] target_put_sess_cmd+0xb7/0x120 [target_core_mod]
[<ffffffffc09baa8a>] core_tmr_abort_task+0x12a/0x180 [target_core_mod]
[<ffffffffc09bcfc1>] target_tmr_work+0x111/0x120 [target_core_mod]
[<ffffffff86c8fed1>] process_one_work+0x121/0x3c0
[<ffffffff86c901b3>] worker_thread+0x43/0x4d0
[<ffffffff86c90170>] ? process_one_work+0x3c0/0x3c0
[<ffffffff86c90170>] ? process_one_work+0x3c0/0x3c0
[<ffffffff86c959a4>] kthread+0xd4/0xf0
[<ffffffff86c958d0>] ? kthread_worker_fn+0x100/0x100
[<ffffffff8708f922>] ret_from_fork+0x22/0x30
Code: 99 a6 c6 0f b6 83 8c 02 00 00 e9 3f ff ff ff be 05 0d 00 00 48
c7 c7 e8 6e 64 c0 e8 53 69 65 c6 48 83 c4 08 5b 41 5c 41 5d 5d c3 <0f>
0b 0f 0b 48 8b bb 90 02 00 00 e8 58 14 71 c6 e9 2c ff ff ff
RIP [<ffffffffc0625ba8>] qlt_free_cmd+0x148/0x160 [qla2xxx]
RSP <ffffaba461687d30>
---[ end trace 82b53ac345806386 ]---
Given the trace above, it would seem this is an issue in the qla2xxx
driver's target mode code, where the following bit seems particularly
relevant:
linux-4.9.30/drivers/scsi/qla2xxx/qla_target.c
void qlt_free_cmd(struct qla_tgt_cmd *cmd)
{
struct qla_tgt_sess *sess = cmd->sess;
ql_dbg(ql_dbg_tgt, cmd->vha, 0xe074,
"%s: se_cmd[%p] ox_id %04x\n",
__func__, &cmd->se_cmd,
be16_to_cpu(cmd->atio.u.isp24.fcp_hdr.ox_id));
BUG_ON(cmd->cmd_in_wq);
Can anybody confirm this is indeed an issue with the qla2xxx driver?
Is there any more information you'd need to be able to work the issue?
Our issue seems to resemble this issue raised earlier quite well:
https://www.spinics.net/lists/target-devel/msg11721.html
Also, after some more digging through our logs, it seems the issue is
actually more consistenly reproducible than I previously thought:
Jun 11 05:17:20 stor01 kernel: [391797.675664] kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3319!
Jun 18 05:48:41 stor01 kernel: [585368.954773] kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3319!
Jun 2 23:12:58 stor02 kernel: [44741.553078] kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3319!
Jun 11 04:26:43 stor02 kernel: [389539.617330] kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3319!
Jun 18 04:23:21 stor02 kernel: [581918.471500] kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3319!
Given one exception, it always occurs Saturday night or early Sunday
morning, depending on how you want to look at it :)
So far we haven't been able to strongly correlate these times against
anything particular event yet.
Regards,
Pascal de Bruijn
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html