Re: ESX FC host connectivity issues

Dan Lane <dracodan@xxxxxxxxx> · Fri, 4 Mar 2016 20:11:16 -0500



On Fri, Mar 4, 2016 at 1:49 AM, Dan Lane <dracodan@xxxxxxxxx> wrote:
> On Sun, Feb 28, 2016 at 4:02 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
>> On Sun, 2016-02-28 at 12:55 -0800, Nicholas A. Bellinger wrote:
>>> On Sun, 2016-02-28 at 14:13 -0500, Dan Lane wrote:
>>
>> <SNIP>
>>
>>> > Unfortunately I'm about to leave town for a few weeks, so I have very
>>> > little time to look at this.  That said, let's talk about this... I
>>> > built the latest kernel using linux-next as well as the torvalds build
>>> > git last night.  Here are the commands I used (in case you see any
>>> > problems).
>>> >
>>> > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>> > git remote add linux-next
>>> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>>> > cd linux
>>> > git remote add linux-next
>>> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>>> > git fetch linux-next
>>> > git fetch --tags linux-next
>>> > cp /boot/config-4.3.4-300.fc23.x86_64 .config
>>> > make oldconfig
>>> > make -j8 bzImage; make -j8 modules; make -j8 modules_install; make -j8 install
>>> >
>>> > This resulted in a functioning 4.5rc5+ kernel.  a matter of hours
>>> > later the storage once again disappeared from my ESXi hosts.  I
>>> > understand there may be things I need to tweak on my hosts, but should
>>> > those things cause LIO to stop responding from the target server?
>>> > It's back to acting the same exact way as before (with the
>>> > target-pending/4.4-stable from a month ago), I can't stop the service
>>> > or kill the process.
>>> >
>>> > # uname -a
>>> > Linux dracofiler.home.lan 4.5.0-rc5+ #2 SMP Sat Feb 27 15:22:25 EST
>>> > 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> >
>>>
>>> You don't need to keep updating the kernel.
>>>
>>> As per the reply to David, you'll need to either explicitly disable ESX
>>> side ATS heartbeat to avoid this well-known ESX bug that effects every
>>> target w/ VAAI as per VMWare's own -kb, or set emulate_caw=0 to disable
>>> AtomicTestandSet all-together:
>>>
>>> http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574
>>>
>>> To repeat, there are no target side changes to avoid this well known ESX
>>> 5.5u2+ host bug.
>>>
>>> You need to either disable ATS heartbeat on the ESX 5.5u2+ host side, or
>>> disable COMPARE_AND_WRITE all-together.
>>>
>>
>> To reiterate again from:
>>
>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956
>>
>> Symptoms:
>>
>> "An ESXi 5.5 Update 2 or ESXi 6.0 host loses connectivity to a VMFS5
>> datastore."
>>
>> "Note: These symptoms are seen in connection with the use of VAAI ATS
>> heartbeat with storage arrays supplied by several different vendors."
>>
>> Cause:
>>
>> "A change in the VMFS heartbeat update method was introduced in ESXi 5.5
>> Update 2, to help optimize the VMFS heartbeat process. Whereas the
>> legacy method involves plain SCSI reads and writes with the VMware ESXi
>> kernel handling validation, the new method offloads the validation step
>> to the storage system. This is similar to other VAAI-related offloads.
>>
>> This optimization results in a significant increase in the volume of ATS
>> commands the ESXi kernel issues to the storage system and resulting
>> increased load on the storage system. Under certain circumstances, VMFS
>> heartbeat using ATS may fail with false ATS miscompare which causes the
>> ESXi kernel to reverify its access to VMFS datastores. This leads to the
>> Lost access to datastore messages."
>>
>
> Nicholas: The problem isn't with the ATS "bug", in fact I don't have
> any mention of ATS anywhere in my vmkernel.log
>
> [root@labhost4:/tmp/scratch/log] grep ATS vmkernel.log
> [root@labhost4:/tmp/scratch/log] zcat vmkernel.0.gz | grep ATS
> [root@labhost4:/tmp/scratch/log] zcat vmkernel.1.gz | grep ATS
> [root@labhost4:/tmp/scratch/log] zcat vmkernel.2.gz | grep ATS
> [root@labhost4:/tmp/scratch/log] zcat vmkernel.3.gz | grep ATS
> [root@labhost4:/tmp/scratch/log] zcat vmkernel.4.gz | grep ATS
> [root@labhost4:/tmp/scratch/log]
>
> Also, my friend David did disable ATS on his target server and the
> crash still occurred.  I just got home a couple of hours ago so I
> haven't had a chance, but the above tells me that the problem is not
> related to ATS.  Also, during this testing I only had one ESXi host
> turned on, which is where the logs are from.
>
> I just restarted the target server, and with pretty much zero load on
> the server I got this in messages on the target server:
> [  275.145225] ABORT_TASK: Found referenced qla2xxx task_tag: 1184312
> [  275.145274] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1184312
> [  312.412465] ABORT_TASK: Found referenced qla2xxx task_tag: 1176128
> [  312.412511] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1176128
> [  313.413499] ABORT_TASK: Found referenced qla2xxx task_tag: 1219556
> [  318.729670] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1219556
> [  318.730244] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1194652
> [  318.730737] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1196720
> [  318.731215] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1217708
> [  318.731658] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1218896
> [  318.732111] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182024
> [  318.732531] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1168032
> [  327.528277] ABORT_TASK: Found referenced qla2xxx task_tag: 1139300
>
> See the attachment for the vmkernel.log from the same exact time
> period, it was too big for here.
>
> This is the point where I can no longer control the service on the
> target, running service target stop results in the aforementioned hung
> task, here is the output of "cat /proc/$PID/stack" after I try
> stopping the hung task:
> [root@dracofiler ~]# cat /proc/1911/stack
> [<ffffffffa053c0ee>] tcm_qla2xxx_tpg_enable_store+0xde/0x1a0 [tcm_qla2xxx]
> [<ffffffff812b8b7a>] configfs_write_file+0x9a/0x100
> [<ffffffff81234967>] __vfs_write+0x37/0x120
> [<ffffffff81235289>] vfs_write+0xa9/0x1a0
> [<ffffffff812361b5>] SyS_write+0x55/0xc0
> [<ffffffff817aa56e>] entry_SYSCALL_64_fastpath+0x12/0x71
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> It's been almost a week since I worked on this, so please forgive me
> if I missed one of your suggestions for something to try or request
> for information that I missed.  Just let me know what it is and I'll
> do it.
>
> Thanks
> Dan

Nicholas,
    Have you had a chance to look into this yet?  I don't mean to rush
you, but my time at home between jobs is very short and I really want
to find an answer to this!

Thanks
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html