Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 11 Jul 2017 10:52:31 -0700

Hi Pascal,

On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote:
> On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote:
> > On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote:
> > > On 07-07-17 06:26, Nicholas A. Bellinger wrote:
> > > > On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
> > > >> So abort_task can still be observed, but they do not result in a
> > > >> non-functional not-quite-PANICked machine anymore.
> > > > Thank alot for the bug report and your continuous testing to get this
> > > > resolved.  The patch is queued up in target-pending/for-next with your
> > > > Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
> > > > kernels get this fix as well.
> > > >
> > > > Thanks again.
> > > 
> > > I'm afraid we may not be quite there yet after all...
> > > 
> > > So we had the other two machines run an md check this weekend
> > > as well, again with a rediculously high synx_speed_max:
> > > 
> > > Jul  9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0
> > > Jul  9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
> > > Jul  9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> > > Jul  9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k.
> > > Jul  9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332
> > > Jul  9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332
> > > Jul  9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380
> > > Jul  9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380
> > > Jul  9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620
> > > Jul  9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140
> > 
> > This means the outstanding I/O was located, but never completed back.
> > 
> > Or, there is something else blocked waiting for completion like
> > outstanding qla2xxx WRITE transfer, before the abort may proceed.
> > 
> > > Jul  9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done.
> > > 
> > 
> > Btw, where there any hung task warnings while this was happening..?
> > 
> > > The machine in question was still responsive (was accepting
> > > SSH logins), however it seemed VMware hosts weren't seeing
> > > the volume anymore (presumably due to the heavy IO on the backend).
> > > 
> > > Also, several (12?) kworkers seemed stuck in a D state.
> > > 
> > 
> > Next time that happens, do a 'cat /proc/$PID/stack' to see where those
> > kworkers are blocked in un-interruptible sleep.
> > 
> > > When my collegue tried to reboot the machine it got
> > > (presumably) stuck on
> > > 
> > > /usr/bin/targetctl clear
> > > 
> > > After which it was forcefully rebooted :)
> > > 
> > > Sorry we don't have any more detailed info at this point.
> > > We haven't been able to reproduce this on a
> > > different machine yet :(
> > 
> > Ok, please try with the following debug patch to verify if it's blocked
> > on qla_tgt_cmd->write_pending_abort_comp.
> > 
> > diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> > index 8c1bf9b..1199969 100644
> > --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> > +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> > @@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
> >  	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
> >  	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
> >  		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> > +		printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag);
> >  		wait_for_completion(&cmd->write_pending_abort_comp);
> >  		return 0;
> >  	}
> > @@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
> >  		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
> >  		 */
> >  		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
> > +			u64 tag = cmd->se_cmd.tag;
> > +
> > +			printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
> >  			complete(&cmd->write_pending_abort_comp);
> > +			printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
> >  			return;
> >  		}
> >  
> 
> Whoops, mixed up the printk() output above..
> 
> Please use this one instead:
> 
> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> index 8c1bf9b..01ffb8a 100644
> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> @@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>  {
>  	struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd);
>  	unsigned long flags;
> +	u64 tag = se_cmd->tag;
> +
>  	/*
>  	 * Check for WRITE_PENDING status to determine if we need to wait for
>  	 * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data().
> @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>  	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
>  	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
>  		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> +		printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
>  		wait_for_completion(&cmd->write_pending_abort_comp);
> +		printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
>  		return 0;
>  	}
>  	spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>  	 */
>  	cmd->cmd_in_wq = 0;
>  
> +	if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
> +		printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n",
> +			cmd->se_cmd.tag, cmd->write_data_transferred);
> +	}
> +
>  	cmd->vha->tgt_counters.qla_core_ret_ctio++;
>  	if (!cmd->write_data_transferred) {
>  		/*
> @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>  		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
>  		 */
>  		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
> +			printk("Completing write_pending_abort_comp for tag: %llu\n",
> +				cmd->se_cmd.tag);
>  			complete(&cmd->write_pending_abort_comp);
>  			return;
>  		}
> 

Btw, since you're still observing issues with the original change, it
has been reverted in target-pending/for-next for now until we get the
second issue sorted out

I'll plan to merge an updated version post v4.13-rc1 once we get a
handle on what's going on.

Thanks again.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html