On Sun, Feb 28, 2016 at 4:02 PM, Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> wrote: > On Sun, 2016-02-28 at 12:55 -0800, Nicholas A. Bellinger wrote: >> On Sun, 2016-02-28 at 14:13 -0500, Dan Lane wrote: > > <SNIP> > >> > Unfortunately I'm about to leave town for a few weeks, so I have very >> > little time to look at this. That said, let's talk about this... I >> > built the latest kernel using linux-next as well as the torvalds build >> > git last night. Here are the commands I used (in case you see any >> > problems). >> > >> > git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >> > git remote add linux-next >> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git >> > cd linux >> > git remote add linux-next >> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git >> > git fetch linux-next >> > git fetch --tags linux-next >> > cp /boot/config-4.3.4-300.fc23.x86_64 .config >> > make oldconfig >> > make -j8 bzImage; make -j8 modules; make -j8 modules_install; make -j8 install >> > >> > This resulted in a functioning 4.5rc5+ kernel. a matter of hours >> > later the storage once again disappeared from my ESXi hosts. I >> > understand there may be things I need to tweak on my hosts, but should >> > those things cause LIO to stop responding from the target server? >> > It's back to acting the same exact way as before (with the >> > target-pending/4.4-stable from a month ago), I can't stop the service >> > or kill the process. >> > >> > # uname -a >> > Linux dracofiler.home.lan 4.5.0-rc5+ #2 SMP Sat Feb 27 15:22:25 EST >> > 2016 x86_64 x86_64 x86_64 GNU/Linux >> > >> >> You don't need to keep updating the kernel. >> >> As per the reply to David, you'll need to either explicitly disable ESX >> side ATS heartbeat to avoid this well-known ESX bug that effects every >> target w/ VAAI as per VMWare's own -kb, or set emulate_caw=0 to disable >> AtomicTestandSet all-together: >> >> http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574 >> >> To repeat, there are no target side changes to avoid this well known ESX >> 5.5u2+ host bug. >> >> You need to either disable ATS heartbeat on the ESX 5.5u2+ host side, or >> disable COMPARE_AND_WRITE all-together. >> > > To reiterate again from: > > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956 > > Symptoms: > > "An ESXi 5.5 Update 2 or ESXi 6.0 host loses connectivity to a VMFS5 > datastore." > > "Note: These symptoms are seen in connection with the use of VAAI ATS > heartbeat with storage arrays supplied by several different vendors." > > Cause: > > "A change in the VMFS heartbeat update method was introduced in ESXi 5.5 > Update 2, to help optimize the VMFS heartbeat process. Whereas the > legacy method involves plain SCSI reads and writes with the VMware ESXi > kernel handling validation, the new method offloads the validation step > to the storage system. This is similar to other VAAI-related offloads. > > This optimization results in a significant increase in the volume of ATS > commands the ESXi kernel issues to the storage system and resulting > increased load on the storage system. Under certain circumstances, VMFS > heartbeat using ATS may fail with false ATS miscompare which causes the > ESXi kernel to reverify its access to VMFS datastores. This leads to the > Lost access to datastore messages." > Nicholas: The problem isn't with the ATS "bug", in fact I don't have any mention of ATS anywhere in my vmkernel.log [root@labhost4:/tmp/scratch/log] grep ATS vmkernel.log [root@labhost4:/tmp/scratch/log] zcat vmkernel.0.gz | grep ATS [root@labhost4:/tmp/scratch/log] zcat vmkernel.1.gz | grep ATS [root@labhost4:/tmp/scratch/log] zcat vmkernel.2.gz | grep ATS [root@labhost4:/tmp/scratch/log] zcat vmkernel.3.gz | grep ATS [root@labhost4:/tmp/scratch/log] zcat vmkernel.4.gz | grep ATS [root@labhost4:/tmp/scratch/log] Also, my friend David did disable ATS on his target server and the crash still occurred. I just got home a couple of hours ago so I haven't had a chance, but the above tells me that the problem is not related to ATS. Also, during this testing I only had one ESXi host turned on, which is where the logs are from. I just restarted the target server, and with pretty much zero load on the server I got this in messages on the target server: [ 275.145225] ABORT_TASK: Found referenced qla2xxx task_tag: 1184312 [ 275.145274] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1184312 [ 312.412465] ABORT_TASK: Found referenced qla2xxx task_tag: 1176128 [ 312.412511] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1176128 [ 313.413499] ABORT_TASK: Found referenced qla2xxx task_tag: 1219556 [ 318.729670] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1219556 [ 318.730244] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1194652 [ 318.730737] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1196720 [ 318.731215] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1217708 [ 318.731658] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1218896 [ 318.732111] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182024 [ 318.732531] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1168032 [ 327.528277] ABORT_TASK: Found referenced qla2xxx task_tag: 1139300 See the attachment for the vmkernel.log from the same exact time period, it was too big for here. This is the point where I can no longer control the service on the target, running service target stop results in the aforementioned hung task, here is the output of "cat /proc/$PID/stack" after I try stopping the hung task: [root@dracofiler ~]# cat /proc/1911/stack [<ffffffffa053c0ee>] tcm_qla2xxx_tpg_enable_store+0xde/0x1a0 [tcm_qla2xxx] [<ffffffff812b8b7a>] configfs_write_file+0x9a/0x100 [<ffffffff81234967>] __vfs_write+0x37/0x120 [<ffffffff81235289>] vfs_write+0xa9/0x1a0 [<ffffffff812361b5>] SyS_write+0x55/0xc0 [<ffffffff817aa56e>] entry_SYSCALL_64_fastpath+0x12/0x71 [<ffffffffffffffff>] 0xffffffffffffffff It's been almost a week since I worked on this, so please forgive me if I missed one of your suggestions for something to try or request for information that I missed. Just let me know what it is and I'll do it. Thanks Dan
Attachment:
vmkernel-snip.log
Description: Binary data