Re: ESX FC host connectivity issues

DAVID S <shraderdm@xxxxxxxxx> · Sat, 30 Apr 2016 00:36:10 -0400

Re-sending in plain text mode.

On Sat, Apr 30, 2016 at 12:34 AM, DAVID S <shraderdm@xxxxxxxxx> wrote:
> Hi,
>
> I would like to add that I have also seen these same issues, and I can
> replicate the exact same problems by repeating the steps that Dan noted in
> his original email.
>
> I also have a similar setup (multiple hosts running ESXi with fibre channel
> HBAs, and a storage server running LIO target on kernel 4.5). I too have
> tested multiple configurations, e.g. with fibre channel switches in between
> the hosts and storage, direct connections between the storage server and
> hosts, and every possible mix of the two. I have implemented all the changes
> that the developers have noted in these emails, such as disabling VAAI
> heartbeat, etc. and my target still crashes regularly.
>
> I am also a consultant, (for Red Hat, not VMware) and I use this system to
> test architecture designs for customers, and I have moved to using only
> local storage on my hosts because the unpredictable LIO crashes were
> hindering my ability to do anything in my environment.
>
> Even if the answer is to remove the support for VAAI in the name of
> reliability, I would definitely prefer having a working target over an
> unreliable one that supports VAAI.
>
> I am very willing to assist with testing this as well, and I can provide
> logs, etc. Please let me know what I can do to assist with this, and I think
> everyone would appreciate this being fixed.
>
> Thanks,
> David
>
> TL;DR Can we work together to get this fixed? Can we get a fork for testing
> without VAAI?
>
> On Fri, Apr 29, 2016 at 10:49 AM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>>
>> On Tue, Apr 26, 2016 at 11:27 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>> > On Tue, Apr 26, 2016 at 11:24 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>> >>
>> >> On Apr 24, 2016 11:50 PM, "Dan Lane" <dracodan@xxxxxxxxx> wrote:
>> >>>
>> >>> Nick,
>> >>> I tried the raising the timeouts (to double the original) and also
>> >>> increasing/decreasing the queue depth, neither seemed to make a
>> >>> difference, although at this point the infrequency of the crashes has
>> >>> made troubleshooting quite challenging.
>> >>>
>> >>> I recently had an idea though - I've always felt like it was the
>> >>> target giving up, not the ESXi hosts, but I couldn't think of a good
>> >>> way to prove this before.  It occurred to me that I could test with a
>> >>> subset of hosts powered on until the target became unavailable, then
>> >>> power those hosts off before powering on a different host that would
>> >>> be unaware of the other hosts deciding the link was unavailable.  The
>> >>> result was that the host that I powered on alone did NOT see the
>> >>> storage!  Once I rebooted the target server the host in question could
>> >>> see the storage.  If you can think of a way that the information about
>> >>> the target being unusable would be passed to the host that was only
>> >>> powered on later (other than vCenter, which was powered off) I would
>> >>> love to hear it.  I honestly may be missing something here, but I
>> >>> can't think of anything that would cause this other than the storage
>> >>> server failing.
>> >>>
>> >>> Thanks,
>> >>> Dan
>> >>>
>> >>> On Thu, Mar 31, 2016 at 2:48 AM, Nicholas A. Bellinger
>> >>> <nab@xxxxxxxxxxxxxxx> wrote:
>> >>> > On Wed, 2016-03-30 at 15:01 -0400, Dan Lane wrote:
>> >>> >> Nicholas,
>> >>> >>      Can you please take another look at the fiber channel code and
>> >>> >> see if you can find the cause of the problems with ESXi/VAAI?
>> >>> >
>> >>> > There are two options at this point to get to root cause of why your
>> >>> > ESX
>> >>> > hosts are continuously generating I/O timeouts and subsequent
>> >>> > ABORT_TASKs, even when the host is idle.
>> >>> >
>> >>> > First, have you looked into changing the ESX host FC LLD side
>> >>> > timeout
>> >>> > and queuedepth that I asked about earlier here..?
>> >>> >
>> >>> > http://www.spinics.net/lists/target-devel/msg11844.html
>> >>> >
>> >>> > What effect does reducing the queue depth and increasing the timeout
>> >>> > have on the rate in which ABORT_TASKs are being generated..?
>> >>> >
>> >>> > Beyond that, you'll need to engage directly the QLogic folks (CC'ed)
>> >>> > to
>> >>> > generate a firmware dump on both host and target sides for them to
>> >>> > analyze to figure out what's going on.
>> >>> >
>> >>> > Are you using QLogic HBAs on the ESX host side as well..?
>> >>> >
>> >>
>> >> Resending this in case it was ignored because I got excited and forgot
>> >> to
>> >> correct Google's automatic top posting that it does... Yep, totally
>> >> blaming
>> >> Google there!
>> >>
>> >> Nick,
>> >> I tried the raising the timeouts (to double the original) and also
>> >> increasing/decreasing the queue depth, neither seemed to make a
>> >> difference,
>> >> although at this point the infrequency of the crashes has made
>> >> troubleshooting quite challenging.
>> >>
>> >> I recently had an idea though - I've always felt like it was the target
>> >> giving up, not the ESXi hosts, but I couldn't think of a good way to
>> >> prove
>> >> this before.  It occurred to me that I could test with a subset of
>> >> hosts
>> >> powered on until the target became unavailable, then power those hosts
>> >> off
>> >> before powering on a different host that would be unaware of the other
>> >> hosts
>> >> deciding the link was unavailable.  The result was that the host that I
>> >> powered on alone did NOT see the storage!  Once I rebooted the target
>> >> server
>> >> the host in question could see the storage.  If you can think of a way
>> >> that
>> >> the information about the target being unusable would be passed to the
>> >> host
>> >> that was only powered on later (other than vCenter, which was powered
>> >> off) I
>> >> would love to hear it.  I honestly may be missing something here, but I
>> >> can't think of anything that would cause this other than the storage
>> >> server
>> >> failing.
>> >>
>> >> Thanks,
>> >> Dan
>> >
>> > And then my phone decides to convert it to HTML, causing it to get
>> > dropped... ugh
>>
>> What does it take to get a response to the messages I send about this?
>>  I felt like I had a pretty concrete idea about getting to the bottom
>> of the problem, and yet I'm being completely ignored...  Do you simply
>> have no intention of resolving this issue?  Is this the same way your
>> paying customers are treated?  I have literally been trying to get
>> Fiber Channel target working reliably for YEARS!  It was November 2013
>> when kernel 3.12 was released with an attempt at implementing VAAI
>> that everything went from reliable and great to horribly broken.
>> Since then I have heard everything blamed for my problems - you said
>> it was my hosts, my storage server, my RAID cards, my disk drives, my
>> FC cards, my FC switches, ESXi... and so I replaced it all, even to
>> the point of building an entirely new storage server and buying new
>> switches and hosts, yeah - I've spent a LOT on this.  I then sat here
>> and beta tested kernel after kernel, updated firmwares, drivers, etc.,
>> spent countless hours trying different settings, and worked with
>> friends to get them trying this to see if they had the same results
>> (and they did).  There has been ton of finger pointing, something that
>> I as a consultant in the field do everything to avoid... It never
>> resolves anything, it just makes people lose confidence in the person
>> pointing the finger.
>>
>> My biggest question is - do the developers of this stuff even test it?
>>  And I don't mean in simulations, I mean do you build out a test
>> system with Fiber Channel cards and ESXi (the only thing that uses
>> VAAI) and rigorously test this?  I know the answer, you couldn't
>> possibly be doing this, otherwise you would see the problems first
>> hand.  I've even tried offering access to my system to the developers
>> so you could truly test this, since I know hardware isn't cheap
>> (actually that's not true, Fiber Channel IS cheap and ESXi is free),
>> but you never took me up on that offer.  I have also mentioned that I
>> work for VMware, so I have access to internal only documentation and
>> the VMware engineering team, I could literally be your best resource
>> to get this resolved, but instead I'm being snubbed.
>>
>> This leads me to believe one thing - you don't care about users of the
>> open source LIO, you only care about the paying users of your Datera
>> products.  This is exactly the situation everyone feared when it was
>> decided to have a commercial company be responsible for an open source
>> component of the Linux kernel.  If you don't like my conclusion, do
>> something to change it, even something as simple as replying to my
>> email saying that you're looking at it would at least appease my
>> feeling like I'm being ignored.
>>
>> Dan
>
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html