On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote: > On 30-06-17 09:22, Nicholas A. Bellinger wrote: <SNIP> > > Fantastic, this is exactly the info I was looking for. :) > > > > So decoding the 'Canceling outstanding qla_tgt_cmd->cmd_in_wq = 1' bits: > > > > *) qla_tgt_cmd->state: 2 > > > > qla_tgt_cmd->state = 2 is QLA_TGT_STATE_DATA_IN, which means FCP > > data-out was successfully received, and the WRITE CDB has been submitted > > into LIO backend storage. > > > > *) se_cmd->se_cmd_flags: 0x01c00109 > > > > This bitmask means the following flags are set: > > > > SCF_SUPPORTED_SAM_OPCODE > > SCF_SCSI_DATA_CDB > > SCF_SE_LUN_CMD > > SCF_ACK_KREF > > SCF_USE_CPUID > > SCF_TASK_ATTR_SET > > > > Nothing out of the ordinarily here.. > > > > *) se_cmd->transport_state: 0x00000091 > > > > CMD_T_ABORTED > > CMD_T_SENT; > > CMD_T_DEV_ACTIVE; > > > > This means the WRITE command sent to the backend device, but did not > > complete before it was explicitly aborted by the ESX host. > > > > Based on what's in the logs above, this also looks as expected. > > > > *) se_cmd->t_task_cdb: 0x2a > > > > So this command is a WRITE_10 = 0x2a. > > > > This would indicate WRITE latency for the LIO backend storage is getting > > sufficiently high to trigger ESX host side ABORT_TASK timeouts, which > > are much more aggressive than normal Linux host I/O timeouts. > > Would you happen to know whether the ESX I/O timeouts depend on the disk > type (SSD vs Non-SSD) advertised by LIO? > > And if so, would it be possible to ask LIO to lie about this, or would > that be inadvisable for other reasons? > Unfortunately, this timeout is completely host OS dependent and there is nothing defined in SCSI for a target to explicitly modify it. With ESX + iSCSI for example, the I/O timeout value before a TMR ABORT_TASK gets generated is hard-coded (eg: no user configurable in ESX 5.x+) to only 5 seconds. Vendors who ship commercial products usually end up doing a bunch of tricks in their backend driver to generate a SAM_STAT_BUSY (eg: retriable status) to avoid TMR ABORT_TASK if the I/O is taking longer than a few seconds to complete. IIRC with ESX + FCP, this value is configurable but is specific to the individual SCSI host LLDs. Here's the doc for qla2xxx: http://filedownloads.qlogic.com/files/manual/69771/FC0054606-00.pdf Himanshu + Quinn, do you recall what this tunable is called..? > > So separate from the qla2xxx BUG_ON here discussed below, the high > > backend device latency is worth looking into, as usually once ESX host > > starts triggering ABORT_TASK timeouts it ends up taking the datastore > > offline. > > Presumably this is due to the md check running without any practical > limit to sync_speed_max, > and having a noop scheduler on the ssds where ioniceing the resync > kernel thread might not be particularly helpful. That would most certainly (with spinning media) be driving up backend latency sufficiently high to trigger ESX's default I/O timeouts. > > So please test without the initial debug patch on your v4.9.y based > > setup. > > > > Himanshu + Quinn, please review as after further groking of what Pascal > > has observed, these are two obvious bugs. > > > > From 4db6a8145940d0bbd10265020d681961ce2d3238 Mon Sep 17 00:00:00 2001 > > From: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> > > Date: Fri, 30 Jun 2017 00:08:13 -0700 > > Subject: [PATCH] qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR > > ABORT > > <SNIP> > > So we've had 4.9.35+1eb42f965c+4db6a81459 running on three machines for > a few days now, all quite stable thus far. Great. Thanks for confirming. > > Most of the kernel messages below can be observed multiple times, I've > only included each one once for sake of brevity. > > On the first machine (connected to two ESX hosts, via two single switch > fabrics), with md check disabled for the time being: > > [ 476.568636] qla2xxx/21:00:00:24:ff:23:6b:83: Unsupported SCSI Opcode > 0x85, sending CHECK_CONDITION. Yep, that's a normal warning for ATA_PASSHTHOUGH being unsupported from ESX's smartd. > > On the second machine (connected to nine ESX hosts, via two multi switch > fabrics), with md check disabled for the time being: > > [ 102.436426] qla2xxx/21:00:00:24:ff:3c:d0:95: Unsupported SCSI Opcode > 0x85, sending CHECK_CONDITION. > [ 438.710613] Detected MISCOMPARE for addr: ffffa0d49425d000 buf: > ffffa0d496c63600 > [ 438.718139] Target/iblock: Send MISCOMPARE check condition and sense > [154986.514515] TARGET_CORE[qla2xxx]: Detected NON_EXISTENT_LUN Access > for 0x00000001 Normal COMPARE_AND_WRITE warnings when multiple ESX hosts are attempting to get exclusive WRITE access via VAAI AtomicTestandSet for a particular VMFS extent. > > On the third machine (connected to a single ESX host, via two multi > switch fabrics), with md check forced many many times (purposefully with > a practically unlimited sync_speed_max): > > [218406.984078] md: data-check of RAID array md0 > [218406.984082] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. > [218406.984083] md: using maximum available idle IO bandwidth (but not > more than 1000000 KB/sec) for data-check. > [218406.984102] md: using 128k window, over a total of 405798912k. > [219111.385372] qla2xxx/21:00:00:24:ff:01:fa:9d: Unsupported SCSI Opcode > 0x85, sending CHECK_CONDITION. > [219497.980914] ABORT_TASK: Found referenced qla2xxx task_tag: 1195684 > [219497.986460] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: > 1195684 > [219502.807141] ABORT_TASK: Found referenced qla2xxx task_tag: 1196116 > [219503.092716] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: > 1196116 > [219503.092725] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: > 1196068 > [219508.154863] Detected MISCOMPARE for addr: ffff880f86e98000 buf: > ffff880f84fce000 > [219508.154865] Target/iblock: Send MISCOMPARE check condition and sense > [220140.288970] ABORT_TASK: Found referenced qla2xxx task_tag: 1243972 > [220140.520201] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: > 1243972 > [220161.302883] ABORT_TASK: Found referenced qla2xxx task_tag: 1245124 > [220161.505205] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: > 1245124 > [220176.303282] ABORT_TASK: Found referenced qla2xxx task_tag: 1245700 > [220176.462535] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: > 1245700 > [220528.353205] md: md0: data-check done > Btw, the difference between "Found referenced qla2xxx task tag", and "Sending TMR_FUNCTION_COMPLETE" tells how much time between when the ABORT_TASK was received by LIO, vs. when your backend storage completed I/O back to LIO. For example, with: [220161.302883] ABORT_TASK: Found referenced qla2xxx task_tag: 1245124 [220161.505205] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1245124 Which was ~200 milliseconds. So the total amount of time to this particular I/O to your MD device would be the default ESX FCP LLD host timeout (assuming 5 seconds), plus ~200 milliseconds. > So abort_task can still be observed, but they do not result in a > non-functional not-quite-PANICked machine anymore. > Thank alot for the bug report and your continuous testing to get this resolved. The patch is queued up in target-pending/for-next with your Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y kernels get this fix as well. Thanks again. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html