Re: ESX FC host connectivity issues

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Sat, 05 Mar 2016 12:57:30 -0800

On Sat, 2016-03-05 at 13:26 -0500, Dan Lane wrote:
> On Sat, Mar 5, 2016 at 1:01 AM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > On Fri, 2016-03-04 at 01:49 -0500, Dan Lane wrote:
> >> On Sun, Feb 28, 2016 at 4:02 PM, Nicholas A. Bellinger
> >> <nab@xxxxxxxxxxxxxxx> wrote:
> >> > On Sun, 2016-02-28 at 12:55 -0800, Nicholas A. Bellinger wrote:
> >
> > <SNIP>
> >
> >> > To reiterate again from:
> >> >
> >> > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956
> >> >
> >> > Symptoms:
> >> >
> >> > "An ESXi 5.5 Update 2 or ESXi 6.0 host loses connectivity to a VMFS5
> >> > datastore."
> >> >
> >> > "Note: These symptoms are seen in connection with the use of VAAI ATS
> >> > heartbeat with storage arrays supplied by several different vendors."
> >> >
> >> > Cause:
> >> >
> >> > "A change in the VMFS heartbeat update method was introduced in ESXi 5.5
> >> > Update 2, to help optimize the VMFS heartbeat process. Whereas the
> >> > legacy method involves plain SCSI reads and writes with the VMware ESXi
> >> > kernel handling validation, the new method offloads the validation step
> >> > to the storage system. This is similar to other VAAI-related offloads.
> >> >
> >> > This optimization results in a significant increase in the volume of ATS
> >> > commands the ESXi kernel issues to the storage system and resulting
> >> > increased load on the storage system. Under certain circumstances, VMFS
> >> > heartbeat using ATS may fail with false ATS miscompare which causes the
> >> > ESXi kernel to reverify its access to VMFS datastores. This leads to the
> >> > Lost access to datastore messages."
> >> >
> >>
> >> Nicholas: The problem isn't with the ATS "bug", in fact I don't have
> >> any mention of ATS anywhere in my vmkernel.log
> >>
> >
> > Yes, you are most certainly hitting ATS heartbeat timeouts.
> >
> > There is a reason why every major storage vendor says you MUST disable
> > ATS heartbeat with all VMFS5 mounts with ESX v5.5u2+, or completely
> > disable COMPARE_AND_WRITE all-together at target level.
> >
> > In a perfect world it would be nice to not have to make ESX host
> > changes, but we don't live in a perfect world!
> >
> >> [root@labhost4:/tmp/scratch/log] grep ATS vmkernel.log
> >> [root@labhost4:/tmp/scratch/log] zcat vmkernel.0.gz | grep ATS
> >> [root@labhost4:/tmp/scratch/log] zcat vmkernel.1.gz | grep ATS
> >> [root@labhost4:/tmp/scratch/log] zcat vmkernel.2.gz | grep ATS
> >> [root@labhost4:/tmp/scratch/log] zcat vmkernel.3.gz | grep ATS
> >> [root@labhost4:/tmp/scratch/log] zcat vmkernel.4.gz | grep ATS
> >> [root@labhost4:/tmp/scratch/log]
> >>
> >
> > That is not how you check for ATS heartbeats.  See below.
> >
> >> Also, my friend David did disable ATS on his target server and the
> >> crash still occurred.
> >>  I just got home a couple of hours ago so I
> >> haven't had a chance, but the above tells me that the problem is not
> >> related to ATS.  Also, during this testing I only had one ESXi host
> >> turned on, which is where the logs are from.
> >>
> >> I just restarted the target server, and with pretty much zero load on
> >> the server I got this in messages on the target server:
> >> [  275.145225] ABORT_TASK: Found referenced qla2xxx task_tag: 1184312
> >> [  275.145274] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1184312
> >> [  312.412465] ABORT_TASK: Found referenced qla2xxx task_tag: 1176128
> >> [  312.412511] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1176128
> >> [  313.413499] ABORT_TASK: Found referenced qla2xxx task_tag: 1219556
> >> [  318.729670] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1219556
> >> [  318.730244] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1194652
> >> [  318.730737] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1196720
> >> [  318.731215] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1217708
> >> [  318.731658] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1218896
> >> [  318.732111] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182024
> >> [  318.732531] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1168032
> >> [  327.528277] ABORT_TASK: Found referenced qla2xxx task_tag: 1139300
> >>
> >> See the attachment for the vmkernel.log from the same exact time
> >> period, it was too big for here.
> >
> > ATS is using SCSI command COMPARE_AND_WRITE = 0x89:
> >
> > # cat vmkernel-snip.log | grep 0x89
> > 2016-03-04T06:29:01.597Z cpu15:33312)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x89 (0x43a185bfa5c0, 32790) to dev "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba1:C0:T1:L0" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:EVAL
> > 2016-03-04T06:29:01.597Z cpu15:33312)ScsiDeviceIO: 2645: Cmd(0x43a185bfa5c0) 0x89, CmdSN 0x21f93 from world 32790 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:29:02.597Z cpu14:33312)ScsiDeviceIO: 2645: Cmd(0x43a186506240) 0x89, CmdSN 0x21f95 from world 32790 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:29:16.711Z cpu8:33312)ScsiDeviceIO: 2645: Cmd(0x439d8034b200) 0x89, CmdSN 0x21f97 from world 32825 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
> > 2016-03-04T06:29:29.744Z cpu8:33312)ScsiDeviceIO: 2645: Cmd(0x439d8550b3c0) 0x89, CmdSN 0x21f99 from world 32826 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
> > 2016-03-04T06:29:42.775Z cpu8:33312)ScsiDeviceIO: 2645: Cmd(0x43a180333700) 0x89, CmdSN 0x21f9b from world 32827 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:29:55.789Z cpu8:33312)ScsiDeviceIO: 2645: Cmd(0x439d87290fc0) 0x89, CmdSN 0x21f9d from world 32825 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:30:08.794Z cpu8:33312)ScsiDeviceIO: 2645: Cmd(0x439d8036eb80) 0x89, CmdSN 0x21f9f from world 32826 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
> > 2016-03-04T06:30:10.229Z cpu8:33312)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x89 (0x43a186f55480, 32827) to dev "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba1:C0:T1:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:EVAL
> > 2016-03-04T06:30:10.229Z cpu8:33312)ScsiDeviceIO: 2607: Cmd(0x43a186f55480) 0x89, CmdSN 0x21fa1 from world 32827 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:30:10.243Z cpu8:33312)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x89 (0x43a186f55480, 32827) to dev "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba1:C0:T1:L0" Failed: H:0xc D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:NONE
> > 2016-03-04T06:30:10.243Z cpu8:33312)ScsiDeviceIO: 2607: Cmd(0x43a186f55480) 0x89, CmdSN 0x21fa1 from world 32827 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0xc D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:30:19.854Z cpu10:33312)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x89 (0x43a186f55480, 32827) to dev "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba1:C0:T1:L0" Failed: H:0x1 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:FAILOVER
> > 2016-03-04T06:30:22.029Z cpu15:33313)WARNING: NMP: nmpCompleteRetryForPath:352: Retry cmd 0x89 (0x43a186f55480) to dev "naa.6001405a5ce03d8529a4341a454252a8" failed on path "vmhba2:C0:T1:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:30:22.029Z cpu15:33313)ScsiDeviceIO: 2645: Cmd(0x43a186f55480) 0x89, CmdSN 0x21fa1 from world 32827 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:30:35.032Z cpu15:33313)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x89 (0x439d85181d40, 32825) to dev "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba2:C0:T1:L0" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Act:EVAL
> > 2016-03-04T06:30:35.032Z cpu15:33313)ScsiDeviceIO: 2645: Cmd(0x439d85181d40) 0x89, CmdSN 0x21fa3 from world 32825 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
> > 2016-03-04T06:30:48.237Z cpu15:33313)ScsiDeviceIO: 2645: Cmd(0x439d855c33c0) 0x89, CmdSN 0x21fa5 from world 32826 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:31:01.442Z cpu15:33313)ScsiDeviceIO: 2645: Cmd(0x43a186a693c0) 0x89, CmdSN 0x21fa7 from world 32827 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
> > 2016-03-04T06:31:02.468Z cpu15:33313)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x89 (0x439d85181d40, 32825) to dev "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba2:C0:T1:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Act:EVAL
> > 2016-03-04T06:31:02.468Z cpu15:33313)ScsiDeviceIO: 2607: Cmd(0x439d85181d40) 0x89, CmdSN 0x21fa9 from world 32825 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
> > 2016-03-04T06:31:02.483Z cpu15:33313)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x89 (0x439d85181d40, 32825) to dev "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba2:C0:T1:L0" Failed: H:0xc D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Act:NONE
> > 2016-03-04T06:31:02.483Z cpu15:33313)ScsiDeviceIO: 2607: Cmd(0x439d85181d40) 0x89, CmdSN 0x21fa9 from world 32825 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0xc D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
> > 2016-03-04T06:31:14.444Z cpu6:33061)ScsiDeviceIO: 2645: Cmd(0x439d85181d40) 0x89, CmdSN 0x21fa9 from world 32825 to dev "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
> >
> > And yes, the ATS heartbeat timeout corresponds to internal
> > COMPARE_AND_WRITE failures:
> >
> > # cat vmkernel-snip.log | grep " HB"
> > 2016-03-04T06:27:59.950Z cpu11:32827)HBX: 276: 'dracofiler-lun0': HB at offset 3956736 - Reclaimed heartbeat [Timeout]:
> > 2016-03-04T06:30:21.781Z cpu2:33924)HBX: 2801: 'dracofiler-lun0': HB at offset 3956736 - Waiting for timed out HB:
> > 2016-03-04T06:30:22.029Z cpu2:33924)HBX: 2801: 'dracofiler-lun0': HB at offset 3956736 - Waiting for timed out HB:
> > 2016-03-04T06:30:35.234Z cpu2:33924)HBX: 2801: 'dracofiler-lun0': HB at offset 3956736 - Waiting for timed out HB:
> > 2016-03-04T06:30:48.439Z cpu2:33924)HBX: 2801: 'dracofiler-lun0': HB at offset 3956736 - Waiting for timed out HB:
> > 2016-03-04T06:31:01.442Z cpu2:33924)HBX: 2801: 'dracofiler-lun0': HB at offset 3956736 - Waiting for timed out HB:
> > 2016-03-04T06:31:14.444Z cpu2:33924)HBX: 2801: 'dracofiler-lun0': HB at offset 3956736 - Waiting for timed out HB:
> >
> > The sense ASQ codes above for failed COMPARE_AND_WRITE = 0x89 commands are:
> >
> >    0x20 = /* INVALID COMMAND OPERATION CODE */
> >    0x24 = /* INVALID FIELD IN CDB */
> >
> > which confirms ATS commands are being internally aborted on your ESX
> > v5.5u2+ host side config, resulting in repeated ATS heartbeat timeouts
> > until ESX takes the device offline.
> >
> >>
> >> This is the point where I can no longer control the service on the
> >> target, running service target stop results in the aforementioned hung
> >> task, here is the output of "cat /proc/$PID/stack" after I try
> >> stopping the hung task:
> >> [root@dracofiler ~]# cat /proc/1911/stack
> >> [<ffffffffa053c0ee>] tcm_qla2xxx_tpg_enable_store+0xde/0x1a0 [tcm_qla2xxx]
> >> [<ffffffff812b8b7a>] configfs_write_file+0x9a/0x100
> >> [<ffffffff81234967>] __vfs_write+0x37/0x120
> >> [<ffffffff81235289>] vfs_write+0xa9/0x1a0
> >> [<ffffffff812361b5>] SyS_write+0x55/0xc0
> >> [<ffffffff817aa56e>] entry_SYSCALL_64_fastpath+0x12/0x71
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >>
> >> It's been almost a week since I worked on this, so please forgive me
> >> if I missed one of your suggestions for something to try or request
> >> for information that I missed.  Just let me know what it is and I'll
> >> do it.
> >
> > To reiterate, the hung task shutdown bug you are triggering with Fedora
> > -fb is separate from the ESX ATS heartbeat issues detailed above.
> >
> > So first things first for ESX.  Let's finally see how ATS w/o ATS
> > heartbeat works for your VMFS5 mounts to avoid generating constant ATS
> > heartbeat timeouts + ABORT_TASKs.
> >
> > Otherwise you need to explicitly set emulate_caw=0 to enforce pre-ATS
> > operation for all backend devices in order to get a working ESX v5.5u2+
> > environment, if you can't change ESX host configs.
> >
> 
> SUCCESS!!!
> 
> It's limited success, I don't think we're totally out of the woods
> yet, but you were right about the ATS stuff, disabling it has made my
> storage stable.
> 

To confirm, did you verify with option #1 host side ATS heartbeat
disabled, or option #2 with target side emulate_caw=0..?

> I think the critical thing for me was understanding that the ATS bug
> and the hung task shutdown bug are two separate problems.  In the past
> the two seemed directly linked, since the service only seems to stop
> responding to control commands after the host had dropped the path,
> might just be a coincidence though.
> 
> I do still have some errors that I would like to discuss:
> 
> In the target server messages log:
> Mar  5 12:45:11 dracofiler kernel: ABORT_TASK: Found referenced
> qla2xxx task_tag: 1194916
> Mar  5 12:45:11 dracofiler kernel: ABORT_TASK: Sending
> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1194916
> (I get a group 3-5 of these errors about every 5 minutes)
> 

As we verified earlier, your qla2xxx FC stats counters look at expected
target side.  Have you looked at ESX host side FC stats counters yet..?

If those look OK too, then I'd recommend following the previous
discussion to reduce ESX host side qla2xxx LLD queue depth and/or
increase I/O timeouts, and see if that has any effect on avoiding these
constant ABORT_TASK timeouts.

> in the ESXi vmkernel.log:
> 2016-03-05T17:44:30.838Z cpu9:33312)WARNING: NMP:
> nmp_DeviceRequestFastDeviceProbe:237: NMP device
> "naa.6001405a5ce03d8529a4341a454252a8" state in doubt; requested fast
> path state update...
> 2016-03-05T17:44:30.838Z cpu9:33312)ScsiDeviceIO: 2645:
> Cmd(0x439d8746d7c0) 0x12, CmdSN 0x29181 from world 0 to dev
> "naa.6001405a5ce03d8529a4341a454252a8" failed H:0x8 D:0x0 P:0x0
> Possible sense data: 0x5 0x24 0x0.
> 2016-03-05T17:44:51.278Z cpu9:33312)NMP:
> nmp_ThrottleLogForDevice:3178: Cmd 0x12 (0x439d8746d7c0, 0) to dev
> "naa.6001405a5ce03d8529a4341a454252a8" on path "vmhba1:C0:T1:L0"
> Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Act:EVAL
> (These errors seem to come at about the same interval as the ABORT_TASK errors)
> 2016-03-05T16:15:55.803Z cpu15:33057)NMP:
> nmp_ThrottleLogForDevice:3178: Cmd 0x9e (0x43a180365300, 0) to dev
> "naa.500000e015072570" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2
> P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
> (these are much less common than the 0x12 command failures)
> 
> Most importantly of course is that the storage is NOT becoming
> unavailable during these events!
> 

This indicates the number of internally failed SCSI commands for ESX
is now lower without all the extra COMPARE_AND_WRITE commands for
ATS heartbeat, and below ESX's omterma; threshold for retries before
taking the device offline.

Can the device still be taken offline (eventually) due to constant
ABORT_TASKs, even without ATS heartbeat or COMPARE_AND_WRITE completely
disabled..?  Most certainly, if ESX's SCSI I/O retry limit is reached
for the same command.

> If I understand correctly, these map to INQUIRY and SERVICE ACTION
> IN(16).  I don't know if these are critical or not, but seeing the
> "state in doubt; requested fast path state update" message concerns
> me.
> 

INQUIRY is for querying LUN metadata, and SERVICE_ACTION_IN(16) is most
likely a SAI_READ_CAPACITY_16 query.

Both of these commands not effected by backend I/O performance from the
target perspective, which further indicates some manner of ESX SCSI host
side issue to debug.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html