Hi Pascal & Co, On Mon, 2017-06-26 at 15:20 +0200, Pascal de Bruijn wrote: <SNIP> > In the meanwhile I think we now know what's causing the circumstances > that lets this issue surface. > > We can reliably reproduce the issue when we run 'iozone -a -I' in a VM > that's placed on the LIO provided storage and kick off a md check: > > echo check > /sys/block/md0/md/sync_action > > This is also probably why we were experiencing these issues consistently > on the night between Saturday and Sunday, as an md check would be kicked > off from cron. > > Since we were experimenting with md tuning udev rules, particularly our > sync_speed_min change may have exacerbated things: > > SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", > TEST=="md/sync_speed_min", ATTR{md/sync_speed_min}="100000" > SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", > TEST=="md/sync_speed_max", ATTR{md/sync_speed_max}="1000000" > So it looks like the backend storage device provided to LIO is reaching sufficiently high enough latency to cause ESX ABORT_TASK timeouts to fire.. > > After staring at this, there are two scenarios during ABORT_TASK where > > this scenario could occur. > > > > Namely, there are two cases where tcm_qla2xxx.c calls > > tcm_qla2xxx_free_cmd() internally during ABORT_TASK: > > > > - During tcm_qla2xxx_handle_data_work() when cmd->aborted is detected. > > - During tcm_qla2xxx_aborted_task() when QLA_TGT_STATE_NEW or > > QLA_TGT_STATE_DATA_IN are set. > > > > My guess is the latter case is happening, that is resulting in > > tcm_qla2xxx_free_cmd() queuing up qla_tgt_cmd->work from > > transport_cmd_finish_abort() -> tcm_qla2xxx_aborted_task(), immediately > > before target_put_sess_cmd() is called from core_tmr_abort_task() to > > perform the final kref_put(&cmd->kref, ...). > > <SNIP> > > The second is the following patch to catch this scenario in progress > > during qlt_free_cmd(), and clear ->cmd_in-wq + dump the state before > > doing a cancel_work_sync(). > > > > diff --git a/drivers/scsi/qla2xxx/qla_target.h b/drivers/scsi/qla2xxx/qla_target.h > > index d644202..65752af 100644 > > --- a/drivers/scsi/qla2xxx/qla_target.h > > +++ b/drivers/scsi/qla2xxx/qla_target.h > > @@ -876,7 +876,7 @@ struct qla_tgt_cmd { > > unsigned int q_full:1; > > unsigned int term_exchg:1; > > unsigned int cmd_sent_to_fw:1; > > - unsigned int cmd_in_wq:1; > > + unsigned int cmd_in_wq; > > unsigned int aborted:1; > > unsigned int data_work:1; > > unsigned int data_work_free:1; > > diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > > index 75aeb9f..3ceb371 100644 > > --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c > > +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > > @@ -336,6 +336,17 @@ static void tcm_qla2xxx_release_cmd(struct se_cmd *se_cmd) > > } > > > > cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd); > > + > > + if (cmpxchg(&cmd->cmd_in_wq, 1, 0) == 1) { > > + printk("Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1" > > + " state: %d data_work: %d data_work_free: %d trc_flags: 0x%08x" > > + " se_cmd->se_cmd_flags: 0x%08x se_cmd->transport_state: 0x%08x\n" > > + " se_cmd->t_task_cdb: 0x%02x\n", cmd->state, cmd->data_work, > > + cmd->data_work_free, cmd->trc_flags, cmd->se_cmd.se_cmd_flags, > > + cmd->se_cmd.transport_state, cmd->se_cmd.t_task_cdb[0]); > > + cancel_work_sync(&cmd->work); > > + } > > + > > qlt_free_cmd(cmd); > > } > > This patch does apply cleanly to v4.9.34, however it did not compile, as > data_work, data_work_free and trc_flags are not yet present. After > removing those references it did compile fine. > > Here's the relevant part of dmesg: > > [ 103.740391] md: data-check of RAID array md0 > [ 103.740395] md: minimum _guaranteed_ speed: 100000 KB/sec/disk. > [ 103.740397] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [ 103.740416] md: using 128k window, over a total of 405798912k. > [ 435.673247] ABORT_TASK: Found referenced qla2xxx task_tag: 1222180 > [ 435.927330] Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1 state: 2 se_cmd->se_cmd_flags: 0x01c00109 se_cmd->transport_state: 0x00000091 se_cmd->t_task_cdb: 0x2a > [ 435.927351] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1222180 > [ 565.785280] qla2xxx/21:00:00:24:ff:08:c4:8a: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > [ 702.818402] ABORT_TASK: Found referenced qla2xxx task_tag: 1230964 > [ 703.090441] Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1 state: 2 se_cmd->se_cmd_flags: 0x01c00109 se_cmd->transport_state: 0x00000091 se_cmd->t_task_cdb: 0x2a > [ 703.090457] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1230964 > [ 888.912452] ABORT_TASK: Found referenced qla2xxx task_tag: 1215316 > [ 889.133709] Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1 state: 2 se_cmd->se_cmd_flags: 0x01c00109 se_cmd->transport_state: 0x00000091 se_cmd->t_task_cdb: 0x2a > [ 889.133713] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1215316 > [ 1741.356248] ABORT_TASK: Found referenced qla2xxx task_tag: 1222900 > [ 1741.609044] Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1 state: 2 se_cmd->se_cmd_flags: 0x01c00109 se_cmd->transport_state: 0x00000091 se_cmd->t_task_cdb: 0x2a > [ 1741.609047] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1222900 > [ 2365.885579] qla2xxx/21:00:00:24:ff:01:fa:9d: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > [ 2425.729887] ABORT_TASK: Found referenced qla2xxx task_tag: 1242772 > [ 2425.901292] Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1 state: 2 se_cmd->se_cmd_flags: 0x01c00109 se_cmd->transport_state: 0x00000091se_cmd->t_task_cdb: 0x2a > [ 2425.901296] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: > 1242772 > [ 2778.797321] md: md0: data-check done. > [ 4165.989290] qla2xxx/21:00:00:24:ff:01:fa:9d: Unsupported SCSI Opcode > 0x85, sending CHECK_CONDITION. > > Presumably every "Cancelling..." would have been a lockup? Fantastic, this is exactly the info I was looking for. :) So decoding the 'Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1' bits: *) qla_tgt_cmd->state: 2 qla_tgt_cmd->state = 2 is QLA_TGT_STATE_DATA_IN, which means FCP data-out was successfully received, and the WRITE CDB has been submitted into LIO backend storage. *) se_cmd->se_cmd_flags: 0x01c00109 This bitmask means the following flags are set: SCF_SUPPORTED_SAM_OPCODE SCF_SCSI_DATA_CDB SCF_SE_LUN_CMD SCF_ACK_KREF SCF_USE_CPUID SCF_TASK_ATTR_SET Nothing out of the ordinarily here.. *) se_cmd->transport_state: 0x00000091 CMD_T_ABORTED CMD_T_SENT; CMD_T_DEV_ACTIVE; This means the WRITE command sent to the backend device, but did not complete before it was explicitly aborted by the ESX host. Based on what's in the logs above, this also looks as expected. *) se_cmd->t_task_cdb: 0x2a So this command is a WRITE_10 = 0x2a. This would indicate WRITE latency for the LIO backend storage is getting sufficiently high to trigger ESX host side ABORT_TASK timeouts, which are much more aggressive than normal Linux host I/O timeouts. So separate from the qla2xxx BUG_ON here discussed below, the high backend device latency is worth looking into, as usually once ESX host starts triggering ABORT_TASK timeouts it ends up taking the datastore offline. That said, I'm certain you're hitting the bogus tcm_qla2xxx_free_cmd() in tcm_qla2xxx_aborted_task(), which based on the initial analysis is definitely the wrong thing to do. So here's a patch against v4.12-rc that addresses the scenario you where able to trigger, and fixes the second scenario as well. As you noted above, it will take a little massaging to apply to v4.9.x due to qla_tgt_cmd->cmd->cmd_flags becoming qla_tgt_cmd->data_work and ->data_work_free. But that shouldn't be too hard. :) So please test without the initial debug patch on your v4.9.y based setup. Himanshu + Quinn, please review as after further groking of what Pascal has observed, these are two obvious bugs. >From 4db6a8145940d0bbd10265020d681961ce2d3238 Mon Sep 17 00:00:00 2001 From: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> Date: Fri, 30 Jun 2017 00:08:13 -0700 Subject: [PATCH] qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT This patch drops two incorrect usages of tcm_qla2xxx_free_cmd() during TMR ABORT within tcm_qla2xxx_handle_data_work() and tcm_qla2xxx_aborted_task(), which where attempting to dispatch into workqueue context to do tcm_qla2xxx_complete_free() and subsequently invoke transport_generic_free_cmd(). This is incorrect because during TMR ABORT target-core will drop the outstanding se_cmd->cmd_kref references once it has quiesced the se_cmd via transport_wait_for_tasks(), and in the case of qla2xxx it should not attempt to do it's own transport_generic_free_cmd(). As reported by Pascal, this was originally manifesting as a BUG_ON(cmd->cmd_in_wq) in qlt_free_cmd() during TMR ABORT with a LIO backend with sufficently high WRITE latency to trigger a host side TMR ABORT_TASK to occur. In addition, for the case in tcm_qla2xxx_write_pending_status() and tcm_qla2xxx_handle_data_work() that waits for outstanding FCP WRITE data transfer to complete before preceedin with a TMR ABORT, avoid using se_cmd->t_transport_stop_comp that is already used by transport_wait_for_tasks() and use a qla2xxx internal struct completion instead. Reported-by: Pascal de Bruijn <p.debruijn@xxxxxxxxxxx> Cc: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx> Cc: Quinn Tran <quinn.tran@xxxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> # 3.10+ Signed-off-by: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> --- drivers/scsi/qla2xxx/qla_target.c | 4 ++++ drivers/scsi/qla2xxx/qla_target.h | 1 + drivers/scsi/qla2xxx/tcm_qla2xxx.c | 36 +++--------------------------------- 3 files changed, 8 insertions(+), 33 deletions(-) diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c index 401e245..8f8ece9 100644 --- a/drivers/scsi/qla2xxx/qla_target.c +++ b/drivers/scsi/qla2xxx/qla_target.c @@ -4079,6 +4079,8 @@ static struct qla_tgt_cmd *qlt_get_tag(scsi_qla_host_t *vha, cmd = &((struct qla_tgt_cmd *)se_sess->sess_cmd_map)[tag]; memset(cmd, 0, sizeof(struct qla_tgt_cmd)); + init_completion(&cmd->write_pending_abort_comp); + memcpy(&cmd->atio, atio, sizeof(*atio)); cmd->state = QLA_TGT_STATE_NEW; cmd->tgt = vha->vha_tgt.qla_tgt; @@ -5083,6 +5085,8 @@ static int __qlt_send_busy(struct scsi_qla_host *vha, qlt_incr_num_pend_cmds(vha); INIT_LIST_HEAD(&cmd->cmd_list); + init_completion(&cmd->write_pending_abort_comp); + memcpy(&cmd->atio, atio, sizeof(*atio)); cmd->tgt = vha->vha_tgt.qla_tgt; diff --git a/drivers/scsi/qla2xxx/qla_target.h b/drivers/scsi/qla2xxx/qla_target.h index d644202..939e93c5 100644 --- a/drivers/scsi/qla2xxx/qla_target.h +++ b/drivers/scsi/qla2xxx/qla_target.h @@ -868,6 +868,7 @@ struct qla_tgt_cmd { unsigned char sense_buffer[TRANSPORT_SENSE_BUFFER]; spinlock_t cmd_lock; + struct completion write_pending_abort_comp; /* to save extra sess dereferences */ unsigned int conf_compl_supported:1; unsigned int sg_mapped:1; diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c index 75aeb9f..8c1bf9b 100644 --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c @@ -415,6 +415,7 @@ static int tcm_qla2xxx_write_pending(struct se_cmd *se_cmd) static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) { + struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd); unsigned long flags; /* * Check for WRITE_PENDING status to determine if we need to wait for @@ -424,8 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); - wait_for_completion_timeout(&se_cmd->t_transport_stop_comp, - 50); + wait_for_completion(&cmd->write_pending_abort_comp); return 0; } spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); @@ -501,7 +501,6 @@ static int tcm_qla2xxx_handle_cmd(scsi_qla_host_t *vha, struct qla_tgt_cmd *cmd, static void tcm_qla2xxx_handle_data_work(struct work_struct *work) { struct qla_tgt_cmd *cmd = container_of(work, struct qla_tgt_cmd, work); - unsigned long flags; /* * Ensure that the complete FCP WRITE payload has been received. @@ -509,17 +508,6 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) */ cmd->cmd_in_wq = 0; - spin_lock_irqsave(&cmd->cmd_lock, flags); - cmd->data_work = 1; - if (cmd->aborted) { - cmd->data_work_free = 1; - spin_unlock_irqrestore(&cmd->cmd_lock, flags); - - tcm_qla2xxx_free_cmd(cmd); - return; - } - spin_unlock_irqrestore(&cmd->cmd_lock, flags); - cmd->vha->tgt_counters.qla_core_ret_ctio++; if (!cmd->write_data_transferred) { /* @@ -527,7 +515,7 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) * waiting upon completion in tcm_qla2xxx_write_pending_status() */ if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { - complete(&cmd->se_cmd.t_transport_stop_comp); + complete(&cmd->write_pending_abort_comp); return; } @@ -753,31 +741,13 @@ static void tcm_qla2xxx_queue_tm_rsp(struct se_cmd *se_cmd) qlt_xmit_tm_rsp(mcmd); } -#define DATA_WORK_NOT_FREE(_cmd) (_cmd->data_work && !_cmd->data_work_free) static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd) { struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd); - unsigned long flags; if (qlt_abort_cmd(cmd)) return; - - spin_lock_irqsave(&cmd->cmd_lock, flags); - if ((cmd->state == QLA_TGT_STATE_NEW)|| - ((cmd->state == QLA_TGT_STATE_DATA_IN) && - DATA_WORK_NOT_FREE(cmd))) { - cmd->data_work_free = 1; - spin_unlock_irqrestore(&cmd->cmd_lock, flags); - /* - * cmd has not reached fw, Use this trigger to free it. - */ - tcm_qla2xxx_free_cmd(cmd); - return; - } - spin_unlock_irqrestore(&cmd->cmd_lock, flags); - return; - } static void tcm_qla2xxx_clear_sess_lookup(struct tcm_qla2xxx_lport *, -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html