Re: ESX FC host connectivity issues

Dan Lane <dracodan@xxxxxxxxx> · Tue, 26 Apr 2016 23:27:52 -0400



On Tue, Apr 26, 2016 at 11:24 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>
> On Apr 24, 2016 11:50 PM, "Dan Lane" <dracodan@xxxxxxxxx> wrote:
>>
>> Nick,
>> I tried the raising the timeouts (to double the original) and also
>> increasing/decreasing the queue depth, neither seemed to make a
>> difference, although at this point the infrequency of the crashes has
>> made troubleshooting quite challenging.
>>
>> I recently had an idea though - I've always felt like it was the
>> target giving up, not the ESXi hosts, but I couldn't think of a good
>> way to prove this before.  It occurred to me that I could test with a
>> subset of hosts powered on until the target became unavailable, then
>> power those hosts off before powering on a different host that would
>> be unaware of the other hosts deciding the link was unavailable.  The
>> result was that the host that I powered on alone did NOT see the
>> storage!  Once I rebooted the target server the host in question could
>> see the storage.  If you can think of a way that the information about
>> the target being unusable would be passed to the host that was only
>> powered on later (other than vCenter, which was powered off) I would
>> love to hear it.  I honestly may be missing something here, but I
>> can't think of anything that would cause this other than the storage
>> server failing.
>>
>> Thanks,
>> Dan
>>
>> On Thu, Mar 31, 2016 at 2:48 AM, Nicholas A. Bellinger
>> <nab@xxxxxxxxxxxxxxx> wrote:
>> > On Wed, 2016-03-30 at 15:01 -0400, Dan Lane wrote:
>> >> Nicholas,
>> >>      Can you please take another look at the fiber channel code and
>> >> see if you can find the cause of the problems with ESXi/VAAI?
>> >
>> > There are two options at this point to get to root cause of why your ESX
>> > hosts are continuously generating I/O timeouts and subsequent
>> > ABORT_TASKs, even when the host is idle.
>> >
>> > First, have you looked into changing the ESX host FC LLD side timeout
>> > and queuedepth that I asked about earlier here..?
>> >
>> > http://www.spinics.net/lists/target-devel/msg11844.html
>> >
>> > What effect does reducing the queue depth and increasing the timeout
>> > have on the rate in which ABORT_TASKs are being generated..?
>> >
>> > Beyond that, you'll need to engage directly the QLogic folks (CC'ed) to
>> > generate a firmware dump on both host and target sides for them to
>> > analyze to figure out what's going on.
>> >
>> > Are you using QLogic HBAs on the ESX host side as well..?
>> >
>
> Resending this in case it was ignored because I got excited and forgot to
> correct Google's automatic top posting that it does... Yep, totally blaming
> Google there!
>
> Nick,
> I tried the raising the timeouts (to double the original) and also
> increasing/decreasing the queue depth, neither seemed to make a difference,
> although at this point the infrequency of the crashes has made
> troubleshooting quite challenging.
>
> I recently had an idea though - I've always felt like it was the target
> giving up, not the ESXi hosts, but I couldn't think of a good way to prove
> this before.  It occurred to me that I could test with a subset of hosts
> powered on until the target became unavailable, then power those hosts off
> before powering on a different host that would be unaware of the other hosts
> deciding the link was unavailable.  The result was that the host that I
> powered on alone did NOT see the storage!  Once I rebooted the target server
> the host in question could see the storage.  If you can think of a way that
> the information about the target being unusable would be passed to the host
> that was only powered on later (other than vCenter, which was powered off) I
> would love to hear it.  I honestly may be missing something here, but I
> can't think of anything that would cause this other than the storage server
> failing.
>
> Thanks,
> Dan

And then my phone decides to convert it to HTML, causing it to get
dropped... ugh
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html