Re: ESX FC host connectivity issues

Dan Lane <dracodan@xxxxxxxxx> · Fri, 29 Apr 2016 10:49:15 -0400

On Tue, Apr 26, 2016 at 11:27 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
> On Tue, Apr 26, 2016 at 11:24 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>>
>> On Apr 24, 2016 11:50 PM, "Dan Lane" <dracodan@xxxxxxxxx> wrote:
>>>
>>> Nick,
>>> I tried the raising the timeouts (to double the original) and also
>>> increasing/decreasing the queue depth, neither seemed to make a
>>> difference, although at this point the infrequency of the crashes has
>>> made troubleshooting quite challenging.
>>>
>>> I recently had an idea though - I've always felt like it was the
>>> target giving up, not the ESXi hosts, but I couldn't think of a good
>>> way to prove this before.  It occurred to me that I could test with a
>>> subset of hosts powered on until the target became unavailable, then
>>> power those hosts off before powering on a different host that would
>>> be unaware of the other hosts deciding the link was unavailable.  The
>>> result was that the host that I powered on alone did NOT see the
>>> storage!  Once I rebooted the target server the host in question could
>>> see the storage.  If you can think of a way that the information about
>>> the target being unusable would be passed to the host that was only
>>> powered on later (other than vCenter, which was powered off) I would
>>> love to hear it.  I honestly may be missing something here, but I
>>> can't think of anything that would cause this other than the storage
>>> server failing.
>>>
>>> Thanks,
>>> Dan
>>>
>>> On Thu, Mar 31, 2016 at 2:48 AM, Nicholas A. Bellinger
>>> <nab@xxxxxxxxxxxxxxx> wrote:
>>> > On Wed, 2016-03-30 at 15:01 -0400, Dan Lane wrote:
>>> >> Nicholas,
>>> >>      Can you please take another look at the fiber channel code and
>>> >> see if you can find the cause of the problems with ESXi/VAAI?
>>> >
>>> > There are two options at this point to get to root cause of why your ESX
>>> > hosts are continuously generating I/O timeouts and subsequent
>>> > ABORT_TASKs, even when the host is idle.
>>> >
>>> > First, have you looked into changing the ESX host FC LLD side timeout
>>> > and queuedepth that I asked about earlier here..?
>>> >
>>> > http://www.spinics.net/lists/target-devel/msg11844.html
>>> >
>>> > What effect does reducing the queue depth and increasing the timeout
>>> > have on the rate in which ABORT_TASKs are being generated..?
>>> >
>>> > Beyond that, you'll need to engage directly the QLogic folks (CC'ed) to
>>> > generate a firmware dump on both host and target sides for them to
>>> > analyze to figure out what's going on.
>>> >
>>> > Are you using QLogic HBAs on the ESX host side as well..?
>>> >
>>
>> Resending this in case it was ignored because I got excited and forgot to
>> correct Google's automatic top posting that it does... Yep, totally blaming
>> Google there!
>>
>> Nick,
>> I tried the raising the timeouts (to double the original) and also
>> increasing/decreasing the queue depth, neither seemed to make a difference,
>> although at this point the infrequency of the crashes has made
>> troubleshooting quite challenging.
>>
>> I recently had an idea though - I've always felt like it was the target
>> giving up, not the ESXi hosts, but I couldn't think of a good way to prove
>> this before.  It occurred to me that I could test with a subset of hosts
>> powered on until the target became unavailable, then power those hosts off
>> before powering on a different host that would be unaware of the other hosts
>> deciding the link was unavailable.  The result was that the host that I
>> powered on alone did NOT see the storage!  Once I rebooted the target server
>> the host in question could see the storage.  If you can think of a way that
>> the information about the target being unusable would be passed to the host
>> that was only powered on later (other than vCenter, which was powered off) I
>> would love to hear it.  I honestly may be missing something here, but I
>> can't think of anything that would cause this other than the storage server
>> failing.
>>
>> Thanks,
>> Dan
>
> And then my phone decides to convert it to HTML, causing it to get
> dropped... ugh

What does it take to get a response to the messages I send about this?
 I felt like I had a pretty concrete idea about getting to the bottom
of the problem, and yet I'm being completely ignored...  Do you simply
have no intention of resolving this issue?  Is this the same way your
paying customers are treated?  I have literally been trying to get
Fiber Channel target working reliably for YEARS!  It was November 2013
when kernel 3.12 was released with an attempt at implementing VAAI
that everything went from reliable and great to horribly broken.
Since then I have heard everything blamed for my problems - you said
it was my hosts, my storage server, my RAID cards, my disk drives, my
FC cards, my FC switches, ESXi... and so I replaced it all, even to
the point of building an entirely new storage server and buying new
switches and hosts, yeah - I've spent a LOT on this.  I then sat here
and beta tested kernel after kernel, updated firmwares, drivers, etc.,
spent countless hours trying different settings, and worked with
friends to get them trying this to see if they had the same results
(and they did).  There has been ton of finger pointing, something that
I as a consultant in the field do everything to avoid... It never
resolves anything, it just makes people lose confidence in the person
pointing the finger.

My biggest question is - do the developers of this stuff even test it?
 And I don't mean in simulations, I mean do you build out a test
system with Fiber Channel cards and ESXi (the only thing that uses
VAAI) and rigorously test this?  I know the answer, you couldn't
possibly be doing this, otherwise you would see the problems first
hand.  I've even tried offering access to my system to the developers
so you could truly test this, since I know hardware isn't cheap
(actually that's not true, Fiber Channel IS cheap and ESXi is free),
but you never took me up on that offer.  I have also mentioned that I
work for VMware, so I have access to internal only documentation and
the VMware engineering team, I could literally be your best resource
to get this resolved, but instead I'm being snubbed.

This leads me to believe one thing - you don't care about users of the
open source LIO, you only care about the paying users of your Datera
products.  This is exactly the situation everyone feared when it was
decided to have a commercial company be responsible for an open source
component of the Linux kernel.  If you don't like my conclusion, do
something to change it, even something as simple as replying to my
email saying that you're looking at it would at least appease my
feeling like I'm being ignored.

Dan
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html