Re: ESX FC host connectivity issues

Dan Lane <dracodan@xxxxxxxxx> · Fri, 26 Feb 2016 21:25:04 -0500

Okay, despite my ongoing efforts to resolve this, I am no closer to
having a stable storage solution.  Here is what I know so far, let's
figure this out.

First of all, I am NOT alone with my problems, I have a friend that is
experiencing these same problems with using FC.  In fact I would like
to know if anyone is actually running this successfully - surely one
of the developers must have a lab set up to validate the code, right?

Kernel 4.3.4 symptoms:
Kernel panics randomly (as reported in the past).  Nicholas identified
a bug that he believed was causing this and created a patch.

Kernel 4.5rc? - created from the latest kernel source code with the
patch from Nicholas about 3 weeks ago
Runs fine for about a day, then simply stops responding.  There is
absolutely nothing in the /var/log/messages when this happens, and the
service seems to still be running, but no servers can see the storage.
Is there anywhere else I can look at logs?  Is there a way to enable
more verbose logging?  Additionally, once this happens it is
impossible to stop the service, even running a kill -9 on the process
never succeeds, the only thing that can be done at this point is to
reboot the target server.
Oddly, my FC switch still sees the target server.
Here is what "ps aux | grep target" shows for the process after it
crashes and I try to stop it, note the "D"ie "uninterruptible sleep":
root     17055  0.0  0.0 214848 15444 ?        Ds   19:35   0:00
/usr/bin/python3 /usr/bin/targetctl clear

Kernel 3.5.0 (old target server, unpatched because I'm afraid of
breaking target!)
Runs flawlessly all day long, never fails for any reason, no matter
what version of ESXi is used

I believe the problem is related to the LIO implementation of VAAI for
the following reason:
when I used ESXi 5 without VAAI enabled and the 4.3.4 target, I didn't
have any problems.  When I tried ESXi 5.5 and 6 (with VAAI enabled),
LIO crashed.  I also don't have any problems with ESXi 6 against my
old target based on kernel 3.5, which is prior to the implementation
of VAAI.

I'm really tired of having my equipment blamed for the problem, or the
idea that I'm using an old firmware on my FC cards or FC switch.  I
actually spent a large amount of money building the server that I'm
using for LIO because of the suggestion that maybe the backend wasn't
keeping up in the past.  All firmwares have been updated to their
latest, and I have seen the problem even when using a direct
connection (no switch).  I've also used three different physical
servers as the target as well.  In addition, a friend of mine has had
the exact same issues as I mentioned earlier.  One thing that has been
suggested is that my backend disks can't keep up.  my backend is 20x
10k SAS drives in RAID 6 with an intel 730 SSD acting as read and
write cache using LSI cachecade 2.0.  Testing with this setup prior to
a crash has shown 400MB/s reads and writes (likely being limited by
the single 4gb FC connection) and 1-10ms latency, needless to say, I
don't think the problem is my back end.  Additionally, my old target
server that never crashes only has 6 old seagate 7200 RPM drives in
RAID 6, which are good for about 200MB/s reads and writes.

I'm open to doing just about any troubleshooting that could help.
Also, as mentioned in the past I have access to absolutely any version
of ESXi and I have multiple available hosts, so it you would like to
test anything related to that I would be happy to assist.

Thanks
Dan
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html