Re: ESX FC host connectivity issues

Dan Lane <dracodan@xxxxxxxxx> · Fri, 4 Mar 2016 01:49:09 -0500

On Sun, Feb 28, 2016 at 4:02 PM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Sun, 2016-02-28 at 12:55 -0800, Nicholas A. Bellinger wrote:
>> On Sun, 2016-02-28 at 14:13 -0500, Dan Lane wrote:
>
> <SNIP>
>
>> > Unfortunately I'm about to leave town for a few weeks, so I have very
>> > little time to look at this.  That said, let's talk about this... I
>> > built the latest kernel using linux-next as well as the torvalds build
>> > git last night.  Here are the commands I used (in case you see any
>> > problems).
>> >
>> > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> > git remote add linux-next
>> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>> > cd linux
>> > git remote add linux-next
>> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>> > git fetch linux-next
>> > git fetch --tags linux-next
>> > cp /boot/config-4.3.4-300.fc23.x86_64 .config
>> > make oldconfig
>> > make -j8 bzImage; make -j8 modules; make -j8 modules_install; make -j8 install
>> >
>> > This resulted in a functioning 4.5rc5+ kernel.  a matter of hours
>> > later the storage once again disappeared from my ESXi hosts.  I
>> > understand there may be things I need to tweak on my hosts, but should
>> > those things cause LIO to stop responding from the target server?
>> > It's back to acting the same exact way as before (with the
>> > target-pending/4.4-stable from a month ago), I can't stop the service
>> > or kill the process.
>> >
>> > # uname -a
>> > Linux dracofiler.home.lan 4.5.0-rc5+ #2 SMP Sat Feb 27 15:22:25 EST
>> > 2016 x86_64 x86_64 x86_64 GNU/Linux
>> >
>>
>> You don't need to keep updating the kernel.
>>
>> As per the reply to David, you'll need to either explicitly disable ESX
>> side ATS heartbeat to avoid this well-known ESX bug that effects every
>> target w/ VAAI as per VMWare's own -kb, or set emulate_caw=0 to disable
>> AtomicTestandSet all-together:
>>
>> http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574
>>
>> To repeat, there are no target side changes to avoid this well known ESX
>> 5.5u2+ host bug.
>>
>> You need to either disable ATS heartbeat on the ESX 5.5u2+ host side, or
>> disable COMPARE_AND_WRITE all-together.
>>
>
> To reiterate again from:
>
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956
>
> Symptoms:
>
> "An ESXi 5.5 Update 2 or ESXi 6.0 host loses connectivity to a VMFS5
> datastore."
>
> "Note: These symptoms are seen in connection with the use of VAAI ATS
> heartbeat with storage arrays supplied by several different vendors."
>
> Cause:
>
> "A change in the VMFS heartbeat update method was introduced in ESXi 5.5
> Update 2, to help optimize the VMFS heartbeat process. Whereas the
> legacy method involves plain SCSI reads and writes with the VMware ESXi
> kernel handling validation, the new method offloads the validation step
> to the storage system. This is similar to other VAAI-related offloads.
>
> This optimization results in a significant increase in the volume of ATS
> commands the ESXi kernel issues to the storage system and resulting
> increased load on the storage system. Under certain circumstances, VMFS
> heartbeat using ATS may fail with false ATS miscompare which causes the
> ESXi kernel to reverify its access to VMFS datastores. This leads to the
> Lost access to datastore messages."
>

Nicholas: The problem isn't with the ATS "bug", in fact I don't have
any mention of ATS anywhere in my vmkernel.log

[root@labhost4:/tmp/scratch/log] grep ATS vmkernel.log
[root@labhost4:/tmp/scratch/log] zcat vmkernel.0.gz | grep ATS
[root@labhost4:/tmp/scratch/log] zcat vmkernel.1.gz | grep ATS
[root@labhost4:/tmp/scratch/log] zcat vmkernel.2.gz | grep ATS
[root@labhost4:/tmp/scratch/log] zcat vmkernel.3.gz | grep ATS
[root@labhost4:/tmp/scratch/log] zcat vmkernel.4.gz | grep ATS
[root@labhost4:/tmp/scratch/log]

Also, my friend David did disable ATS on his target server and the
crash still occurred.  I just got home a couple of hours ago so I
haven't had a chance, but the above tells me that the problem is not
related to ATS.  Also, during this testing I only had one ESXi host
turned on, which is where the logs are from.

I just restarted the target server, and with pretty much zero load on
the server I got this in messages on the target server:
[  275.145225] ABORT_TASK: Found referenced qla2xxx task_tag: 1184312
[  275.145274] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1184312
[  312.412465] ABORT_TASK: Found referenced qla2xxx task_tag: 1176128
[  312.412511] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1176128
[  313.413499] ABORT_TASK: Found referenced qla2xxx task_tag: 1219556
[  318.729670] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1219556
[  318.730244] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1194652
[  318.730737] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1196720
[  318.731215] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1217708
[  318.731658] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1218896
[  318.732111] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182024
[  318.732531] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1168032
[  327.528277] ABORT_TASK: Found referenced qla2xxx task_tag: 1139300

See the attachment for the vmkernel.log from the same exact time
period, it was too big for here.

This is the point where I can no longer control the service on the
target, running service target stop results in the aforementioned hung
task, here is the output of "cat /proc/$PID/stack" after I try
stopping the hung task:
[root@dracofiler ~]# cat /proc/1911/stack
[<ffffffffa053c0ee>] tcm_qla2xxx_tpg_enable_store+0xde/0x1a0 [tcm_qla2xxx]
[<ffffffff812b8b7a>] configfs_write_file+0x9a/0x100
[<ffffffff81234967>] __vfs_write+0x37/0x120
[<ffffffff81235289>] vfs_write+0xa9/0x1a0
[<ffffffff812361b5>] SyS_write+0x55/0xc0
[<ffffffff817aa56e>] entry_SYSCALL_64_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

It's been almost a week since I worked on this, so please forgive me
if I missed one of your suggestions for something to try or request
for information that I missed.  Just let me know what it is and I'll
do it.

Thanks
Dan
Attachment:
vmkernel-snip.log

Description: Binary data