Hi Pascal, Nic, Sorry for delay in response. > On Jun 19, 2017, at 12:20 AM, Pascal de Bruijn <p.debruijn@xxxxxxxxxxx> wrote: > > Hi, > > As I've illustrated in a previous thread: > ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag > We've been experiencing some kernel lockups while using LIO as an FC target. > > We've collected the kernel trace via serial, and this is the second lockup where we've been able to get proper kernel messages. The trace is very similar to the previous one, so the issue is practically reproducible for us, though thus far not at will. This happens once or twice a week for us, always at night: > > Detected MISCOMPARE for addr: ffff91f3c55ac000 buf: ffff91f3d08dd600 > Target/iblock: Send MISCOMPARE check condition and sense > ABORT_TASK: Found referenced qla2xxx task_tag: 1203652 > ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1203652 > ABORT_TASK: Found referenced qla2xxx task_tag: 1203700 > ------------[ cut here ]------------ > kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319! > invalid opcode: 0000 [#1] SMP > Modules linked in: tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod configfs raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq md_mod ipmi_devintf ses enclosure sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp qla2xxx scsi_transport_fc mei_me mpt3sas igb mei ioatdma i2c_algo_bit dca wmi ipmi_si ipmi_msghandler > CPU: 13 PID: 24122 Comm: kworker/u34:0 Not tainted 4.9.30-ul5 #3 > Hardware name: Supermicro SYS-1028U-E1CRTP+/X10DRU-i+, BIOS 2.0b 08/09/2016 > Workqueue: tmr-iblock target_tmr_work [target_core_mod] > task: ffff91fbd4e42340 task.stack: ffffaba461684000 > RIP: 0010:[<ffffffffc0625ba8>] [<ffffffffc0625ba8>] qlt_free_cmd+0x148/0x160 [qla2xxx] > RSP: 0018:ffffaba461687d30 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: ffff91fbcbd7cd48 RCX: ffffffffc064d692 > RDX: 000000000000e074 RSI: ffff91fbd34c65a0 RDI: 0000000000004000 > RBP: ffffaba461687d50 R08: ffffffffc063a7d0 R09: ffff91fbcbd7cd48 > R10: ffff91f3d9822bc0 R11: ffff91f3ffff9f80 R12: ffff91fbce03b680 > R13: 0000000000000246 R14: 0000000000125df4 R15: ffff91fbcb5ffac0 > FS: 0000000000000000(0000) GS:ffff91fbffd40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f9b9397ef60 CR3: 0000000366407000 CR4: 00000000001406e0 > Stack: > ffff91fb0000048b ffff91fbcbd7cd48 ffff91fbcff9c870 0000000000000246 > ffffaba461687d60 ffffffffc0a49f2f ffffaba461687d88 ffffffffc09bf417 > ffff91fbcbd7cdd0 ffff91fbcbd7cd48 ffff91eee98b0000 ffffaba461687dd8 > Call Trace: > [<ffffffffc0a49f2f>] tcm_qla2xxx_release_cmd+0xf/0x20 [tcm_qla2xxx] > [<ffffffffc09bf417>] target_put_sess_cmd+0xb7/0x120 [target_core_mod] > [<ffffffffc09baa8a>] core_tmr_abort_task+0x12a/0x180 [target_core_mod] > [<ffffffffc09bcfc1>] target_tmr_work+0x111/0x120 [target_core_mod] > [<ffffffff86c8fed1>] process_one_work+0x121/0x3c0 > [<ffffffff86c901b3>] worker_thread+0x43/0x4d0 > [<ffffffff86c90170>] ? process_one_work+0x3c0/0x3c0 > [<ffffffff86c90170>] ? process_one_work+0x3c0/0x3c0 > [<ffffffff86c959a4>] kthread+0xd4/0xf0 > [<ffffffff86c958d0>] ? kthread_worker_fn+0x100/0x100 > [<ffffffff8708f922>] ret_from_fork+0x22/0x30 > Code: 99 a6 c6 0f b6 83 8c 02 00 00 e9 3f ff ff ff be 05 0d 00 00 48 c7 c7 e8 6e 64 c0 e8 53 69 65 c6 48 83 c4 08 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 0b 48 8b bb 90 02 00 00 e8 58 14 71 c6 e9 2c ff ff ff > RIP [<ffffffffc0625ba8>] qlt_free_cmd+0x148/0x160 [qla2xxx] > RSP <ffffaba461687d30> > ---[ end trace 82b53ac345806386 ]--- > > Given the trace above, it would seem this is an issue in the qla2xxx driver's target mode code, where the following bit seems particularly relevant: > > linux-4.9.30/drivers/scsi/qla2xxx/qla_target.c > > void qlt_free_cmd(struct qla_tgt_cmd *cmd) > { > struct qla_tgt_sess *sess = cmd->sess; > > ql_dbg(ql_dbg_tgt, cmd->vha, 0xe074, > "%s: se_cmd[%p] ox_id %04x\n", > __func__, &cmd->se_cmd, > be16_to_cpu(cmd->atio.u.isp24.fcp_hdr.ox_id)); > > BUG_ON(cmd->cmd_in_wq); > > > Can anybody confirm this is indeed an issue with the qla2xxx driver? > Is there any more information you'd need to be able to work the issue? > > Regards, > Pascal de Bruijn > > We are examining stack trace that was posted. However we’ll need help of Nicholas. Nic, This stack trace is very similar to the one we had reported last year as interaction issue between target stack and qla2xxx driver you had added patches following patches to resolve this crash. (https://www.spinics.net/lists/target-devel/msg11691.html) * | 310d3d314be7 target: Fix race with SCF_SEND_DELAYED_TAS handling * | 0f4a943168f3 target: Fix remote-port TMR ABORT + se_cmd fabric stop * | ebde1ca5a908 target: Fix TAS handling for multi-session se_node_acls * | a6d9bb1c9605 target: Fix LUN_RESET active TMR descriptor handling * | febe562c20df target: Fix LUN_RESET active I/O handling for ACK_KREF Looks like in this environment we are still hitting same issue. Does this diff makes sense to add check for TRANSPORT_COMPLETE state so that qlt_free_cmd is not called from normal command path? diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 7a42bfd..35cecae 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -1815,7 +1815,13 @@ void __target_execute_cmd(struct se_cmd *cmd) ret = cmd->execute_cmd(cmd); if (ret) { spin_lock_irq(&cmd->t_state_lock); + cmd->t_state = TRANSPORT_COMPLETE; cmd->transport_state &= ~(CMD_T_BUSY|CMD_T_SENT); + cmd->transport_state |= CMD_T_COMPLETE; spin_unlock_irq(&cmd->t_state_lock); transport_generic_request_failure(cmd, ret); Pascal, Can you send us core dump along with modules (Target+qla2xxx) for further analysis. (I’ll send you location shortly where to upload the files) Also if you are willing to try diff posted here in your environment please let us know if it helps in resolving the issue. Thanks, - Himanshu��.n��������+%������w��{.n����j�����{ay�ʇڙ���f���h������_�(�階�ݢj"��������G����?���&��