Hi Martin, On Mon, 2017-06-05 at 18:05 +0200, Martin Svec wrote: > Hello Nic, > > Today, three of our vSphere VMs running on iSCSI LIO 4.9.30 failed to create a backup snapshot and > hung with errors like "Create virtual machine snapshot xxxxx. Unable to close the > '/vmfs/volumes/.../xxxxx-000001-ctk.vmdk' file: 5 (Input/output error)." or other more general I/O > errors. It always happened during snapshot creation and there were multiple "Detected MISCOMPARE + > Target/iblock: Send MISCOMPARE check condition and sense" in target log at the same time. > Subsequently, virtual machines lost access to their virtual disks and required VM reset. The > failures seem to be independent of each other and VMs ran on different hosts. > So nothing else in the target logs of interest..? I assume the MISCOMPARE warnings occur at the normal rate..? > The storage was upgraded to 4.9.30 only two days ago. However, we have an identical iSCSI LIO > storage running 4.9.27 more than three weeks without any issue in the same vSphere cluster. So I'm > wondering if this could be caused by a stable target patch between 4.9.27 and 4.9.30. Quick look > into changelog shows "target: Fix compare_and_write_callback handling for non GOOD status" as the > only fix related to CAW since 4.9.27. What do you think? > > We have ESXi 5.5.0 rev. 5230635 on all ESXi nodes. Note the 'target: Fix compare_and_write_callback handling for non GOOD status' change only effects COMPARE_AND_WRITE related I/Os that actually fail. That is, unless the underlying backend target device was actually generating hard I/O errors (eg: something like the following where 'sdc' is your target backend device): Buffer I/O error on dev sdc, logical block 0, async page read blk_update_request: I/O error, dev sdc, sector 2097144 blk_update_request: I/O error, dev sdc, sector 2097144 Buffer I/O error on dev sdc, logical block 262143, async page read blk_update_request: I/O error, dev sdc, sector 0 Buffer I/O error on dev sdc, logical block 0, async page read blk_update_request: I/O error, dev sdc, sector 0 then the CAW change above in v4.9.30 won't have any effect. If the issue is reproducible, you can verify by re-enabling the debug message for a hard I/O error in compare_and_write_callback(): diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c index ca42fba..a0de5ab 100644 --- a/drivers/target/target_core_sbc.c +++ b/drivers/target/target_core_sbc.c @@ -479,7 +479,7 @@ static sense_reason_t compare_and_write_callback(struct se_cmd *cmd, bool succes * been failed with a non-zero SCSI status. */ if (cmd->scsi_status) { - pr_debug("compare_and_write_callback: non zero scsi_status:" + printk_ratelimited("compare_and_write_callback: non zero scsi_status:" " 0x%02x\n", cmd->scsi_status); *post_ret = 1; if (cmd->scsi_status == SAM_STAT_CHECK_CONDITION) That said, if you can confirm the backend device is not generating hard I/O errors for COMPARE_AND_WRITE I/O up to target-core, I'd wager the ESX host failures observed aren't specific to the change. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html