Re: ESXi snapshot I/O error after upgrade to 4.9.30

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 08 Jun 2017 22:21:52 -0700

Hi Martin,

On Mon, 2017-06-05 at 18:05 +0200, Martin Svec wrote:
> Hello Nic,
> 
> Today, three of our vSphere VMs running on iSCSI LIO 4.9.30 failed to create a backup snapshot and
> hung with errors like "Create virtual machine snapshot xxxxx. Unable to close the 
> '/vmfs/volumes/.../xxxxx-000001-ctk.vmdk' file: 5 (Input/output error)." or other more general I/O
> errors. It always happened during snapshot creation and there were multiple "Detected MISCOMPARE +
> Target/iblock: Send MISCOMPARE check condition and sense" in target log at the same time.
> Subsequently, virtual machines lost access to their virtual disks and required VM reset. The
> failures seem to be independent of each other and VMs ran on different hosts.
> 

So nothing else in the target logs of interest..?

I assume the MISCOMPARE warnings occur at the normal rate..?

> The storage was upgraded to 4.9.30 only two days ago. However, we have an identical iSCSI LIO
> storage running 4.9.27 more than three weeks without any issue in the same vSphere cluster. So I'm
> wondering if this could be caused by a stable target patch between 4.9.27 and 4.9.30. Quick look
> into changelog shows "target: Fix compare_and_write_callback handling for non GOOD status" as the
> only fix related to CAW since 4.9.27. What do you think?
> 
> We have ESXi 5.5.0 rev. 5230635 on all ESXi nodes.

Note the 'target: Fix compare_and_write_callback handling for non GOOD
status' change only effects COMPARE_AND_WRITE related I/Os that actually
fail.

That is, unless the underlying backend target device was actually
generating hard I/O errors (eg: something like the following where 'sdc'
is your target backend device):

   Buffer I/O error on dev sdc, logical block 0, async page read
   blk_update_request: I/O error, dev sdc, sector 2097144
   blk_update_request: I/O error, dev sdc, sector 2097144
   Buffer I/O error on dev sdc, logical block 262143, async page read
   blk_update_request: I/O error, dev sdc, sector 0
   Buffer I/O error on dev sdc, logical block 0, async page read
   blk_update_request: I/O error, dev sdc, sector 0

then the CAW change above in v4.9.30 won't have any effect.

If the issue is reproducible, you can verify by re-enabling the debug
message for a hard I/O error in compare_and_write_callback():

diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index ca42fba..a0de5ab 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -479,7 +479,7 @@ static sense_reason_t compare_and_write_callback(struct se_cmd *cmd, bool succes
         * been failed with a non-zero SCSI status.
         */
        if (cmd->scsi_status) {
-               pr_debug("compare_and_write_callback: non zero scsi_status:"
+               printk_ratelimited("compare_and_write_callback: non zero scsi_status:"
                        " 0x%02x\n", cmd->scsi_status);
                *post_ret = 1;
                if (cmd->scsi_status == SAM_STAT_CHECK_CONDITION)

That said, if you can confirm the backend device is not generating hard
I/O errors for COMPARE_AND_WRITE I/O up to target-core, I'd wager the
ESX host failures observed aren't specific to the change.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html