Re: ESX FC host connectivity issues

Dan Lane <dracodan@xxxxxxxxx> · Sun, 24 Apr 2016 23:50:35 -0400

Nick,
I tried the raising the timeouts (to double the original) and also
increasing/decreasing the queue depth, neither seemed to make a
difference, although at this point the infrequency of the crashes has
made troubleshooting quite challenging.

I recently had an idea though - I've always felt like it was the
target giving up, not the ESXi hosts, but I couldn't think of a good
way to prove this before.  It occurred to me that I could test with a
subset of hosts powered on until the target became unavailable, then
power those hosts off before powering on a different host that would
be unaware of the other hosts deciding the link was unavailable.  The
result was that the host that I powered on alone did NOT see the
storage!  Once I rebooted the target server the host in question could
see the storage.  If you can think of a way that the information about
the target being unusable would be passed to the host that was only
powered on later (other than vCenter, which was powered off) I would
love to hear it.  I honestly may be missing something here, but I
can't think of anything that would cause this other than the storage
server failing.

Thanks,
Dan

On Thu, Mar 31, 2016 at 2:48 AM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Wed, 2016-03-30 at 15:01 -0400, Dan Lane wrote:
>> Nicholas,
>>      Can you please take another look at the fiber channel code and
>> see if you can find the cause of the problems with ESXi/VAAI?
>
> There are two options at this point to get to root cause of why your ESX
> hosts are continuously generating I/O timeouts and subsequent
> ABORT_TASKs, even when the host is idle.
>
> First, have you looked into changing the ESX host FC LLD side timeout
> and queuedepth that I asked about earlier here..?
>
> http://www.spinics.net/lists/target-devel/msg11844.html
>
> What effect does reducing the queue depth and increasing the timeout
> have on the rate in which ABORT_TASKs are being generated..?
>
> Beyond that, you'll need to engage directly the QLogic folks (CC'ed) to
> generate a firmware dump on both host and target sides for them to
> analyze to figure out what's going on.
>
> Are you using QLogic HBAs on the ESX host side as well..?
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html