Re: ESX FC host connectivity issues

DAVID S <shraderdm@xxxxxxxxx> · Sat, 30 Apr 2016 02:36:44 -0400

On Sat, Apr 30, 2016 at 12:36 AM, DAVID S <shraderdm@xxxxxxxxx> wrote:
> Re-sending in plain text mode.
>
> On Sat, Apr 30, 2016 at 12:34 AM, DAVID S <shraderdm@xxxxxxxxx> wrote:
>> Hi,
>>
>> I would like to add that I have also seen these same issues, and I can
>> replicate the exact same problems by repeating the steps that Dan noted in
>> his original email.
>>
>> I also have a similar setup (multiple hosts running ESXi with fibre channel
>> HBAs, and a storage server running LIO target on kernel 4.5). I too have
>> tested multiple configurations, e.g. with fibre channel switches in between
>> the hosts and storage, direct connections between the storage server and
>> hosts, and every possible mix of the two. I have implemented all the changes
>> that the developers have noted in these emails, such as disabling VAAI
>> heartbeat, etc. and my target still crashes regularly.
>>
>> I am also a consultant, (for Red Hat, not VMware) and I use this system to
>> test architecture designs for customers, and I have moved to using only
>> local storage on my hosts because the unpredictable LIO crashes were
>> hindering my ability to do anything in my environment.
>>
>> Even if the answer is to remove the support for VAAI in the name of
>> reliability, I would definitely prefer having a working target over an
>> unreliable one that supports VAAI.
>>
>> I am very willing to assist with testing this as well, and I can provide
>> logs, etc. Please let me know what I can do to assist with this, and I think
>> everyone would appreciate this being fixed.
>>
>> Thanks,
>> David
>>
>> TL;DR Can we work together to get this fixed? Can we get a fork for testing
>> without VAAI?
>>
>> On Fri, Apr 29, 2016 at 10:49 AM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>>>
>>> On Tue, Apr 26, 2016 at 11:27 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>>> > On Tue, Apr 26, 2016 at 11:24 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>>> >>
>>> >> On Apr 24, 2016 11:50 PM, "Dan Lane" <dracodan@xxxxxxxxx> wrote:
>>> >>>
>>> >>> Nick,
>>> >>> I tried the raising the timeouts (to double the original) and also
>>> >>> increasing/decreasing the queue depth, neither seemed to make a
>>> >>> difference, although at this point the infrequency of the crashes has
>>> >>> made troubleshooting quite challenging.
>>> >>>
>>> >>> I recently had an idea though - I've always felt like it was the
>>> >>> target giving up, not the ESXi hosts, but I couldn't think of a good
>>> >>> way to prove this before.  It occurred to me that I could test with a
>>> >>> subset of hosts powered on until the target became unavailable, then
>>> >>> power those hosts off before powering on a different host that would
>>> >>> be unaware of the other hosts deciding the link was unavailable.  The
>>> >>> result was that the host that I powered on alone did NOT see the
>>> >>> storage!  Once I rebooted the target server the host in question could
>>> >>> see the storage.  If you can think of a way that the information about
>>> >>> the target being unusable would be passed to the host that was only
>>> >>> powered on later (other than vCenter, which was powered off) I would
>>> >>> love to hear it.  I honestly may be missing something here, but I
>>> >>> can't think of anything that would cause this other than the storage
>>> >>> server failing.
>>> >>>
>>> >>> Thanks,
>>> >>> Dan
>>> >>>
>>> >>> On Thu, Mar 31, 2016 at 2:48 AM, Nicholas A. Bellinger
>>> >>> <nab@xxxxxxxxxxxxxxx> wrote:
>>> >>> > On Wed, 2016-03-30 at 15:01 -0400, Dan Lane wrote:
>>> >>> >> Nicholas,
>>> >>> >>      Can you please take another look at the fiber channel code and
>>> >>> >> see if you can find the cause of the problems with ESXi/VAAI?
>>> >>> >
>>> >>> > There are two options at this point to get to root cause of why your
>>> >>> > ESX
>>> >>> > hosts are continuously generating I/O timeouts and subsequent
>>> >>> > ABORT_TASKs, even when the host is idle.
>>> >>> >
>>> >>> > First, have you looked into changing the ESX host FC LLD side
>>> >>> > timeout
>>> >>> > and queuedepth that I asked about earlier here..?
>>> >>> >
>>> >>> > http://www.spinics.net/lists/target-devel/msg11844.html
>>> >>> >
>>> >>> > What effect does reducing the queue depth and increasing the timeout
>>> >>> > have on the rate in which ABORT_TASKs are being generated..?
>>> >>> >
>>> >>> > Beyond that, you'll need to engage directly the QLogic folks (CC'ed)
>>> >>> > to
>>> >>> > generate a firmware dump on both host and target sides for them to
>>> >>> > analyze to figure out what's going on.
>>> >>> >
>>> >>> > Are you using QLogic HBAs on the ESX host side as well..?
>>> >>> >
>>> >>
>>> >> Resending this in case it was ignored because I got excited and forgot
>>> >> to
>>> >> correct Google's automatic top posting that it does... Yep, totally
>>> >> blaming
>>> >> Google there!
>>> >>
>>> >> Nick,
>>> >> I tried the raising the timeouts (to double the original) and also
>>> >> increasing/decreasing the queue depth, neither seemed to make a
>>> >> difference,
>>> >> although at this point the infrequency of the crashes has made
>>> >> troubleshooting quite challenging.
>>> >>
>>> >> I recently had an idea though - I've always felt like it was the target
>>> >> giving up, not the ESXi hosts, but I couldn't think of a good way to
>>> >> prove
>>> >> this before.  It occurred to me that I could test with a subset of
>>> >> hosts
>>> >> powered on until the target became unavailable, then power those hosts
>>> >> off
>>> >> before powering on a different host that would be unaware of the other
>>> >> hosts
>>> >> deciding the link was unavailable.  The result was that the host that I
>>> >> powered on alone did NOT see the storage!  Once I rebooted the target
>>> >> server
>>> >> the host in question could see the storage.  If you can think of a way
>>> >> that
>>> >> the information about the target being unusable would be passed to the
>>> >> host
>>> >> that was only powered on later (other than vCenter, which was powered
>>> >> off) I
>>> >> would love to hear it.  I honestly may be missing something here, but I
>>> >> can't think of anything that would cause this other than the storage
>>> >> server
>>> >> failing.
>>> >>
>>> >> Thanks,
>>> >> Dan
>>> >
>>> > And then my phone decides to convert it to HTML, causing it to get
>>> > dropped... ugh
>>>
>>> What does it take to get a response to the messages I send about this?
>>>  I felt like I had a pretty concrete idea about getting to the bottom
>>> of the problem, and yet I'm being completely ignored...  Do you simply
>>> have no intention of resolving this issue?  Is this the same way your
>>> paying customers are treated?  I have literally been trying to get
>>> Fiber Channel target working reliably for YEARS!  It was November 2013
>>> when kernel 3.12 was released with an attempt at implementing VAAI
>>> that everything went from reliable and great to horribly broken.
>>> Since then I have heard everything blamed for my problems - you said
>>> it was my hosts, my storage server, my RAID cards, my disk drives, my
>>> FC cards, my FC switches, ESXi... and so I replaced it all, even to
>>> the point of building an entirely new storage server and buying new
>>> switches and hosts, yeah - I've spent a LOT on this.  I then sat here
>>> and beta tested kernel after kernel, updated firmwares, drivers, etc.,
>>> spent countless hours trying different settings, and worked with
>>> friends to get them trying this to see if they had the same results
>>> (and they did).  There has been ton of finger pointing, something that
>>> I as a consultant in the field do everything to avoid... It never
>>> resolves anything, it just makes people lose confidence in the person
>>> pointing the finger.
>>>
>>> My biggest question is - do the developers of this stuff even test it?
>>>  And I don't mean in simulations, I mean do you build out a test
>>> system with Fiber Channel cards and ESXi (the only thing that uses
>>> VAAI) and rigorously test this?  I know the answer, you couldn't
>>> possibly be doing this, otherwise you would see the problems first
>>> hand.  I've even tried offering access to my system to the developers
>>> so you could truly test this, since I know hardware isn't cheap
>>> (actually that's not true, Fiber Channel IS cheap and ESXi is free),
>>> but you never took me up on that offer.  I have also mentioned that I
>>> work for VMware, so I have access to internal only documentation and
>>> the VMware engineering team, I could literally be your best resource
>>> to get this resolved, but instead I'm being snubbed.
>>>
>>> This leads me to believe one thing - you don't care about users of the
>>> open source LIO, you only care about the paying users of your Datera
>>> products.  This is exactly the situation everyone feared when it was
>>> decided to have a commercial company be responsible for an open source
>>> component of the Linux kernel.  If you don't like my conclusion, do
>>> something to change it, even something as simple as replying to my
>>> email saying that you're looking at it would at least appease my
>>> feeling like I'm being ignored.
>>>
>>> Dan
>>
>>

So after replying to this email earlier, I spoke with Dan and we
decided to approach testing this from different angles.

He pointed me to a KB explaining how to disable VAAI, so I disabled it
on all my hosts began running benchmark tests on the shared storage
mounts from inside virtual machines (in my experience this is the most
consistent way to cause these crashes).

Shortly after, the storage disappeared from the hosts. I restarted the
target service on my machine, and it hung. I then rebooted the
machine, which also hung. I had to login to the IPMI and do a hard
power cycle.

After all of this, I pulled what I think are the most relevant log
entries related to all of this.

Please let me know if you would like a more detailed version of these
logs, and I'll be sure to send them over.

Thanks!
David

-- Reboot --
Apr 29 22:25:24 storage.example.home kernel: ERST: Can not request
[mem 0xcff69000-0xcff69fff] for ERST.
-- Reboot --
Apr 30 01:31:33 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f095:9: sess ffff8800cacea480 PRLI received, before
plogi ack.
Apr 30 02:01:12 storage.example.home kernel: qla2xxx
[0000:06:00.1]-f095:10: sess ffff880225669f00 PRLI received, before
plogi ack.
Apr 30 02:08:16 storage.example.home systemd[1]: Failed unmounting
Configuration File System.
Apr 30 02:08:16 storage.example.home systemd[1]: Failed unmounting /var.
Apr 30 02:08:16 storage.example.home kernel: watchdog: watchdog0:
watchdog did not stop!
-- Reboot --
Apr 29 22:09:40 storage.example.home kernel: ERST: Can not request
[mem 0xcff69000-0xcff69fff] for ERST.
Apr 30 02:10:28 storage.example.home kernel: qla2xxx
[0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests:
0x2.
Apr 30 02:10:41 storage.example.home kernel: qla2xxx
[0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests:
0x2.
-- Reboot --
Apr 30 02:26:05 storage.example.home kernel: qla2xxx
[0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests:
0x2.
Apr 30 02:26:17 storage.example.home kernel: qla2xxx
[0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests:
0x2.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html