Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 06 Jul 2017 21:26:21 -0700

On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
> On 30-06-17 09:22, Nicholas A. Bellinger wrote:

<SNIP>

> > Fantastic, this is exactly the info I was looking for.  :)
> >
> > So decoding the 'Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1' bits:
> >
> > *) qla_tgt_cmd->state: 2
> >
> > qla_tgt_cmd->state = 2 is QLA_TGT_STATE_DATA_IN, which means FCP
> > data-out was successfully received, and the WRITE CDB has been submitted
> > into LIO backend storage.
> >
> > *) se_cmd->se_cmd_flags: 0x01c00109
> >
> > This bitmask means the following flags are set:
> >   
> >     SCF_SUPPORTED_SAM_OPCODE
> >     SCF_SCSI_DATA_CDB
> >     SCF_SE_LUN_CMD
> >     SCF_ACK_KREF
> >     SCF_USE_CPUID
> >     SCF_TASK_ATTR_SET
> >
> > Nothing out of the ordinarily here..
> >
> > *) se_cmd->transport_state: 0x00000091
> >
> >    CMD_T_ABORTED
> >    CMD_T_SENT;
> >    CMD_T_DEV_ACTIVE;
> >
> > This means the WRITE command sent to the backend device, but did not
> > complete before it was explicitly aborted by the ESX host.
> >
> > Based on what's in the logs above, this also looks as expected.
> >
> > *) se_cmd->t_task_cdb: 0x2a
> >
> > So this command is a WRITE_10 = 0x2a.
> >
> > This would indicate WRITE latency for the LIO backend storage is getting
> > sufficiently high to trigger ESX host side ABORT_TASK timeouts, which
> > are much more aggressive than normal Linux host I/O timeouts.
> 
> Would you happen to know whether the ESX I/O timeouts depend on the disk 
> type (SSD vs Non-SSD) advertised by LIO?
> 
> And if so, would it be possible to ask LIO to lie about this, or would 
> that be inadvisable for other reasons?
> 

Unfortunately, this timeout is completely host OS dependent and there is
nothing defined in SCSI for a target to explicitly modify it.

With ESX + iSCSI for example, the I/O timeout value before a TMR
ABORT_TASK gets generated is hard-coded (eg: no user configurable in ESX
5.x+) to only 5 seconds.  Vendors who ship commercial products usually
end up doing a bunch of tricks in their backend driver to generate a
SAM_STAT_BUSY (eg: retriable status) to avoid TMR ABORT_TASK if the I/O
is taking longer than a few seconds to complete.

IIRC with ESX + FCP, this value is configurable but is specific to the
individual SCSI host LLDs.  Here's the doc for qla2xxx:

http://filedownloads.qlogic.com/files/manual/69771/FC0054606-00.pdf

Himanshu + Quinn, do you recall what this tunable is called..?

> > So separate from the qla2xxx BUG_ON here discussed below, the high
> > backend device latency is worth looking into, as usually once ESX host
> > starts triggering ABORT_TASK timeouts it ends up taking the datastore
> > offline.
> 
> Presumably this is due to the md check running without any practical 
> limit to sync_speed_max,
> and having a noop scheduler on the ssds where ioniceing the resync 
> kernel thread might not be particularly helpful.

That would most certainly (with spinning media) be driving up backend
latency sufficiently high to trigger ESX's default I/O timeouts.

> > So please test without the initial debug patch on your v4.9.y based
> > setup.
> >
> > Himanshu + Quinn, please review as after further groking of what Pascal
> > has observed, these are two obvious bugs.
> >
> >  From 4db6a8145940d0bbd10265020d681961ce2d3238 Mon Sep 17 00:00:00 2001
> > From: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx>
> > Date: Fri, 30 Jun 2017 00:08:13 -0700
> > Subject: [PATCH] qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR
> >   ABORT
> > <SNIP>
> 
> So we've had 4.9.35+1eb42f965c+4db6a81459 running on three machines for 
> a few days now, all quite stable thus far.

Great.  Thanks for confirming.

> 
> Most of the kernel messages below can be observed multiple times, I've 
> only included each one once for sake of brevity.
> 
> On the first machine (connected to two ESX hosts, via two single switch 
> fabrics), with md check disabled for the time being:
> 
> [  476.568636] qla2xxx/21:00:00:24:ff:23:6b:83: Unsupported SCSI Opcode 
> 0x85, sending CHECK_CONDITION.

Yep, that's a normal warning for ATA_PASSHTHOUGH being unsupported from
ESX's smartd.

> 
> On the second machine (connected to nine ESX hosts, via two multi switch 
> fabrics), with md check disabled for the time being:
> 
> [  102.436426] qla2xxx/21:00:00:24:ff:3c:d0:95: Unsupported SCSI Opcode 
> 0x85, sending CHECK_CONDITION.
> [  438.710613] Detected MISCOMPARE for addr: ffffa0d49425d000 buf: 
> ffffa0d496c63600
> [  438.718139] Target/iblock: Send MISCOMPARE check condition and sense
> [154986.514515] TARGET_CORE[qla2xxx]: Detected NON_EXISTENT_LUN Access 
> for 0x00000001

Normal COMPARE_AND_WRITE warnings when multiple ESX hosts are attempting
to get exclusive WRITE access via VAAI AtomicTestandSet for a particular
VMFS extent.

> 
> On the third machine (connected to a single ESX host, via two multi 
> switch fabrics), with md check forced many many times (purposefully with 
> a practically unlimited sync_speed_max):
> 
> [218406.984078] md: data-check of RAID array md0
> [218406.984082] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
> [218406.984083] md: using maximum available idle IO bandwidth (but not 
> more than 1000000 KB/sec) for data-check.
> [218406.984102] md: using 128k window, over a total of 405798912k.
> [219111.385372] qla2xxx/21:00:00:24:ff:01:fa:9d: Unsupported SCSI Opcode 
> 0x85, sending CHECK_CONDITION.
> [219497.980914] ABORT_TASK: Found referenced qla2xxx task_tag: 1195684
> [219497.986460] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 1195684
> [219502.807141] ABORT_TASK: Found referenced qla2xxx task_tag: 1196116
> [219503.092716] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 1196116
> [219503.092725] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 
> 1196068
> [219508.154863] Detected MISCOMPARE for addr: ffff880f86e98000 buf: 
> ffff880f84fce000
> [219508.154865] Target/iblock: Send MISCOMPARE check condition and sense
> [220140.288970] ABORT_TASK: Found referenced qla2xxx task_tag: 1243972
> [220140.520201] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 1243972
> [220161.302883] ABORT_TASK: Found referenced qla2xxx task_tag: 1245124
> [220161.505205] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 1245124
> [220176.303282] ABORT_TASK: Found referenced qla2xxx task_tag: 1245700
> [220176.462535] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 
> 1245700
> [220528.353205] md: md0: data-check done
> 

Btw, the difference between "Found referenced qla2xxx task tag", and
"Sending TMR_FUNCTION_COMPLETE" tells how much time between when the
ABORT_TASK was received by LIO, vs. when your backend storage completed
I/O back to LIO.

For example, with:

[220161.302883] ABORT_TASK: Found referenced qla2xxx task_tag: 1245124
[220161.505205] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1245124

Which was ~200 milliseconds.  So the total amount of time to this
particular I/O to your MD device would be the default ESX FCP LLD host
timeout (assuming 5 seconds), plus ~200 milliseconds.

> So abort_task can still be observed, but they do not result in a 
> non-functional not-quite-PANICked machine anymore.
> 

Thank alot for the bug report and your continuous testing to get this
resolved.  The patch is queued up in target-pending/for-next with your
Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
kernels get this fix as well.

Thanks again.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html