On Fri, Mar 4, 2016 at 1:49 AM, Dan Lane <dracodan@xxxxxxxxx> wrote: > On Sun, Feb 28, 2016 at 4:02 PM, Nicholas A. Bellinger > <nab@xxxxxxxxxxxxxxx> wrote: >> On Sun, 2016-02-28 at 12:55 -0800, Nicholas A. Bellinger wrote: >>> On Sun, 2016-02-28 at 14:13 -0500, Dan Lane wrote: >> >> <SNIP> >> >>> > Unfortunately I'm about to leave town for a few weeks, so I have very >>> > little time to look at this. That said, let's talk about this... I >>> > built the latest kernel using linux-next as well as the torvalds build >>> > git last night. Here are the commands I used (in case you see any >>> > problems). >>> > >>> > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >>> > git remote add linux-next >>> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git >>> > cd linux >>> > git remote add linux-next >>> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git >>> > git fetch linux-next >>> > git fetch --tags linux-next >>> > cp /boot/config-4.3.4-300.fc23.x86_64 .config >>> > make oldconfig >>> > make -j8 bzImage; make -j8 modules; make -j8 modules_install; make -j8 install >>> > >>> > This resulted in a functioning 4.5rc5+ kernel. a matter of hours >>> > later the storage once again disappeared from my ESXi hosts. I >>> > understand there may be things I need to tweak on my hosts, but should >>> > those things cause LIO to stop responding from the target server? >>> > It's back to acting the same exact way as before (with the >>> > target-pending/4.4-stable from a month ago), I can't stop the service >>> > or kill the process. >>> > >>> > # uname -a >>> > Linux dracofiler.home.lan 4.5.0-rc5+ #2 SMP Sat Feb 27 15:22:25 EST >>> > 2016 x86_64 x86_64 x86_64 GNU/Linux >>> > >>> >>> You don't need to keep updating the kernel. >>> >>> As per the reply to David, you'll need to either explicitly disable ESX >>> side ATS heartbeat to avoid this well-known ESX bug that effects every >>> target w/ VAAI as per VMWare's own -kb, or set emulate_caw=0 to disable >>> AtomicTestandSet all-together: >>> >>> http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574 >>> >>> To repeat, there are no target side changes to avoid this well known ESX >>> 5.5u2+ host bug. >>> >>> You need to either disable ATS heartbeat on the ESX 5.5u2+ host side, or >>> disable COMPARE_AND_WRITE all-together. >>> >> >> To reiterate again from: >> >> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956 >> >> Symptoms: >> >> "An ESXi 5.5 Update 2 or ESXi 6.0 host loses connectivity to a VMFS5 >> datastore." >> >> "Note: These symptoms are seen in connection with the use of VAAI ATS >> heartbeat with storage arrays supplied by several different vendors." >> >> Cause: >> >> "A change in the VMFS heartbeat update method was introduced in ESXi 5.5 >> Update 2, to help optimize the VMFS heartbeat process. Whereas the >> legacy method involves plain SCSI reads and writes with the VMware ESXi >> kernel handling validation, the new method offloads the validation step >> to the storage system. This is similar to other VAAI-related offloads. >> >> This optimization results in a significant increase in the volume of ATS >> commands the ESXi kernel issues to the storage system and resulting >> increased load on the storage system. Under certain circumstances, VMFS >> heartbeat using ATS may fail with false ATS miscompare which causes the >> ESXi kernel to reverify its access to VMFS datastores. This leads to the >> Lost access to datastore messages." >> > > Nicholas: The problem isn't with the ATS "bug", in fact I don't have > any mention of ATS anywhere in my vmkernel.log > > [root@labhost4:/tmp/scratch/log] grep ATS vmkernel.log > [root@labhost4:/tmp/scratch/log] zcat vmkernel.0.gz | grep ATS > [root@labhost4:/tmp/scratch/log] zcat vmkernel.1.gz | grep ATS > [root@labhost4:/tmp/scratch/log] zcat vmkernel.2.gz | grep ATS > [root@labhost4:/tmp/scratch/log] zcat vmkernel.3.gz | grep ATS > [root@labhost4:/tmp/scratch/log] zcat vmkernel.4.gz | grep ATS > [root@labhost4:/tmp/scratch/log] > > Also, my friend David did disable ATS on his target server and the > crash still occurred. I just got home a couple of hours ago so I > haven't had a chance, but the above tells me that the problem is not > related to ATS. Also, during this testing I only had one ESXi host > turned on, which is where the logs are from. > > I just restarted the target server, and with pretty much zero load on > the server I got this in messages on the target server: > [ 275.145225] ABORT_TASK: Found referenced qla2xxx task_tag: 1184312 > [ 275.145274] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1184312 > [ 312.412465] ABORT_TASK: Found referenced qla2xxx task_tag: 1176128 > [ 312.412511] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1176128 > [ 313.413499] ABORT_TASK: Found referenced qla2xxx task_tag: 1219556 > [ 318.729670] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1219556 > [ 318.730244] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1194652 > [ 318.730737] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1196720 > [ 318.731215] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1217708 > [ 318.731658] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1218896 > [ 318.732111] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182024 > [ 318.732531] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1168032 > [ 327.528277] ABORT_TASK: Found referenced qla2xxx task_tag: 1139300 > > See the attachment for the vmkernel.log from the same exact time > period, it was too big for here. > > This is the point where I can no longer control the service on the > target, running service target stop results in the aforementioned hung > task, here is the output of "cat /proc/$PID/stack" after I try > stopping the hung task: > [root@dracofiler ~]# cat /proc/1911/stack > [<ffffffffa053c0ee>] tcm_qla2xxx_tpg_enable_store+0xde/0x1a0 [tcm_qla2xxx] > [<ffffffff812b8b7a>] configfs_write_file+0x9a/0x100 > [<ffffffff81234967>] __vfs_write+0x37/0x120 > [<ffffffff81235289>] vfs_write+0xa9/0x1a0 > [<ffffffff812361b5>] SyS_write+0x55/0xc0 > [<ffffffff817aa56e>] entry_SYSCALL_64_fastpath+0x12/0x71 > [<ffffffffffffffff>] 0xffffffffffffffff > > It's been almost a week since I worked on this, so please forgive me > if I missed one of your suggestions for something to try or request > for information that I missed. Just let me know what it is and I'll > do it. > > Thanks > Dan Nicholas, Have you had a chance to look into this yet? I don't mean to rush you, but my time at home between jobs is very short and I really want to find an answer to this! Thanks -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html