Re: ESX FC host connectivity issues

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Sun, 28 Feb 2016 12:55:30 -0800

On Sun, 2016-02-28 at 14:13 -0500, Dan Lane wrote:
> >> Runs fine for about a day, then simply stops responding.  There is
> >> absolutely nothing in the /var/log/messages when this happens, and the
> >> service seems to still be running, but no servers can see the storage.
> >> Is there anywhere else I can look at logs?  Is there a way to enable
> >> more verbose logging?  Additionally, once this happens it is
> >> impossible to stop the service, even running a kill -9 on the process
> >> never succeeds, the only thing that can be done at this point is to
> >> reboot the target server.
> >> Oddly, my FC switch still sees the target server.
> >> Here is what "ps aux | grep target" shows for the process after it
> >> crashes and I try to stop it, note the "D"ie "uninterruptible sleep":
> >> root     17055  0.0  0.0 214848 15444 ?        Ds   19:35   0:00
> >> /usr/bin/python3 /usr/bin/targetctl clear
> >>
> >
> > Not sure what you mean by 'crashing' here, but if your not using
> > v4.5-rc4+ or target-pending/4.4-stable I prepared for you a month ago,
> > then you going to keep hitting the same original bug.
> >
> 
> Unfortunately I'm about to leave town for a few weeks, so I have very
> little time to look at this.  That said, let's talk about this... I
> built the latest kernel using linux-next as well as the torvalds build
> git last night.  Here are the commands I used (in case you see any
> problems).
> 
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git remote add linux-next
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> cd linux
> git remote add linux-next
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> git fetch linux-next
> git fetch --tags linux-next
> cp /boot/config-4.3.4-300.fc23.x86_64 .config
> make oldconfig
> make -j8 bzImage; make -j8 modules; make -j8 modules_install; make -j8 install
> 
> This resulted in a functioning 4.5rc5+ kernel.  a matter of hours
> later the storage once again disappeared from my ESXi hosts.  I
> understand there may be things I need to tweak on my hosts, but should
> those things cause LIO to stop responding from the target server?
> It's back to acting the same exact way as before (with the
> target-pending/4.4-stable from a month ago), I can't stop the service
> or kill the process.
> 
> # uname -a
> Linux dracofiler.home.lan 4.5.0-rc5+ #2 SMP Sat Feb 27 15:22:25 EST
> 2016 x86_64 x86_64 x86_64 GNU/Linux
> 

You don't need to keep updating the kernel.

As per the reply to David, you'll need to either explicitly disable ESX
side ATS heartbeat to avoid this well-known ESX bug that effects every
target w/ VAAI as per VMWare's own -kb, or set emulate_caw=0 to disable
AtomicTestandSet all-together:

http://permalink.gmane.org/gmane.linux.scsi.target.devel/11574

To repeat, there are no target side changes to avoid this well known ESX
5.5u2+ host bug.

You need to either disable ATS heartbeat on the ESX 5.5u2+ host side, or
disable COMPARE_AND_WRITE all-together.

> # ps aux | grep target
> root      2531  0.2  0.0 214848 15500 ?        Ds   11:48   0:00
> /usr/bin/python3 /usr/bin/targetctl clear
> 
> This is what I mean by "crashes", targetctl hangs here and there is
> absolutely nothing I've been able to do to get it to start responding
> again other than rebooting.
> 

So it looks targetcli-fb is able to reach uninterruptable sleep.

Send along your 'cat /proc/$PID/stack' output for the hung targetcli
task, and let's find out what -fb user-space is doing different for
qla2xxx.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html