Re: ESX FC host connectivity issues

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Sat, 27 Feb 2016 13:20:26 -0800

On Fri, 2016-02-26 at 21:25 -0500, Dan Lane wrote:
> Okay, despite my ongoing efforts to resolve this, I am no closer to
> having a stable storage solution.  Here is what I know so far, let's
> figure this out.
> 
> First of all, I am NOT alone with my problems, I have a friend that is
> experiencing these same problems with using FC.  In fact I would like
> to know if anyone is actually running this successfully - surely one
> of the developers must have a lab set up to validate the code, right?
> 
> Kernel 4.3.4 symptoms:
> Kernel panics randomly (as reported in the past).  Nicholas identified
> a bug that he believed was causing this and created a patch.
> 
> Kernel 4.5rc? - created from the latest kernel source code with the
> patch from Nicholas about 3 weeks ago

The patch series below merged for v4.5-rc4, but has not made it back to
the stable tree as-of yet.

[PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
http://www.spinics.net/lists/target-devel/msg11822.html

Which means you'll still need to use target-pending/4.4-stable until
these patches make their way into the stable kernel tree.

> Runs fine for about a day, then simply stops responding.  There is
> absolutely nothing in the /var/log/messages when this happens, and the
> service seems to still be running, but no servers can see the storage.
> Is there anywhere else I can look at logs?  Is there a way to enable
> more verbose logging?  Additionally, once this happens it is
> impossible to stop the service, even running a kill -9 on the process
> never succeeds, the only thing that can be done at this point is to
> reboot the target server.
> Oddly, my FC switch still sees the target server.
> Here is what "ps aux | grep target" shows for the process after it
> crashes and I try to stop it, note the "D"ie "uninterruptible sleep":
> root     17055  0.0  0.0 214848 15444 ?        Ds   19:35   0:00
> /usr/bin/python3 /usr/bin/targetctl clear
> 

Not sure what you mean by 'crashing' here, but if your not using
v4.5-rc4+ or target-pending/4.4-stable I prepared for you a month ago,
then you going to keep hitting the same original bug.

> Kernel 3.5.0 (old target server, unpatched because I'm afraid of
> breaking target!)
> Runs flawlessly all day long, never fails for any reason, no matter
> what version of ESXi is used

Good, It's helpful to know that the same setup works pre VAAI.

> 
> I believe the problem is related to the LIO implementation of VAAI for
> the following reason:
> when I used ESXi 5 without VAAI enabled and the 4.3.4 target, I didn't
> have any problems.  When I tried ESXi 5.5 and 6 (with VAAI enabled),
> LIO crashed.  I also don't have any problems with ESXi 6 against my
> old target based on kernel 3.5, which is prior to the implementation
> of VAAI.
> 

That's what I've been waiting to hear.

You're probably hitting this well known ATS heartbeat bug in ESX v5.5u2
and above:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956

You'll need to manually disable ATS heartbeat from the ESX host side on
all volumes, as recommended by every major storage vendor.

> I'm really tired of having my equipment blamed for the problem, or the
> idea that I'm using an old firmware on my FC cards or FC switch.

We don't know what your hw environment is, and getting all makes +
models + firmware revs for all hosts + targets HBAs helps to understand
what we're dealing with.

I've still never seen anything about your host side FC hardware or
switch setup, after asking multiple times.

>  I
> actually spent a large amount of money building the server that I'm
> using for LIO because of the suggestion that maybe the backend wasn't
> keeping up in the past.  All firmwares have been updated to their
> latest, and I have seen the problem even when using a direct
> connection (no switch).  I've also used three different physical
> servers as the target as well.  In addition, a friend of mine has had
> the exact same issues as I mentioned earlier.  One thing that has been
> suggested is that my backend disks can't keep up.  my backend is 20x
> 10k SAS drives in RAID 6 with an intel 730 SSD acting as read and
> write cache using LSI cachecade 2.0.  Testing with this setup prior to
> a crash has shown 400MB/s reads and writes (likely being limited by
> the single 4gb FC connection) and 1-10ms latency, needless to say, I
> don't think the problem is my back end.  Additionally, my old target
> server that never crashes only has 6 old seagate 7200 RPM drives in
> RAID 6, which are good for about 200MB/s reads and writes.

Like I've explained many times before, ESX is particularity sensitive to
backend device latency. 

That is, if ESX keeps repeatedly generating ABORT_TASKs, it will
eventually take the LUN offline.

I've still never seen 'iostat -xm 2' output from your backend while
workloads are running, which would be useful in order to confirm the
average read + write latency for the backend itself.

If the backend device latency is above the default ql2xcmdtimeout as
previously discussed here:

http://www.spinics.net/lists/target-devel/msg11844.html

Then you'll keep seeing ABORT_TASKs and eventually the ESX host will
take the device offline.

> I'm open to doing just about any troubleshooting that could help.
> Also, as mentioned in the past I have access to absolutely any version
> of ESXi and I have multiple available hosts, so it you would like to
> test anything related to that I would be happy to assist.

Let's see the full vmkernel.log output from the ESX hosts.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html