Re: ESX FC host connectivity issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 16, 2016 at 5:08 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
> On Wed, Mar 16, 2016 at 10:47 AM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>> On Tue, Mar 15, 2016 at 3:49 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>>> On Tue, Mar 15, 2016 at 3:45 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>>>> I went to pull the latest source and I noticed mainline Kernel 4.5 was
>>>> released yesterday.  Did all the recent patches that apply to fiber
>>>> channel make it into this release or do I still need to patch?
>>>>
>>>> Thanks
>>>> Dan
>>>>
>>>> On Fri, Mar 11, 2016 at 11:07 PM, Nicholas A. Bellinger
>>>> <nab@xxxxxxxxxxxxxxx> wrote:
>>>>> On Fri, 2016-03-11 at 18:15 -0500, Dan Lane wrote:
>>>>>> I'm back in town now and ready to try this again.  Should I still try
>>>>>> this patch?
>>>>>
>>>>> Yes, you still need to apply the patch to drop the extra bogus
>>>>> target_put_sess_cmd() call, when !__target_check_io_state() for
>>>>> ABORT_TASK occurs:
>>>>>
>>>>> https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?id=7f54ab5ff52fb0b91569bc69c4a6bc5cac1b768d
>>>>>
>>>>>> I noticed you had submitted a patch a few days ago, so
>>>>>> can I just pull all the latest updates from your git repo?
>>>>>>
>>>>>
>>>>> The PULL request just went out to Linus, and will be included for
>>>>> v4.5 release.
>>>>>
>>>>> It's also CC'ed for stable, and will make it's way down to v3.14.y
>>>>> stable over the next weeks.
>>>>>
>>>
>>> Whoops, sorry about the top post...  I know you said the request was
>>> sent to Linus, I just wanted to confirm that it made it since the time
>>> frame between that last email and when 4.5 was released was so short.
>>>
>>> Thanks again
>>
>> Latest update:
>> With the 4.5 (final) kernel my storage was more stable than ever, but
>> again went inaccessible after about 15 hours.  This is despite very
>> heavy usage last night, the likes of which caused failures in the past
>> (but I was amazed with the performance, I was able to get 650MB/s
>> writes and 750MB/s reads!!!).  The aborts seem to be coming in as
>> steady as they have in the past, which leads me to believe the patch
>> for the extra "bogus target_put_sess_cmd() call" didn't make it in
>> time for the 4.5 release.  If it did, this means there are more
>> problems.
>>
>> Here is a snippet from my messages log before ESXi gave up:
>> Mar 16 07:21:57 dracofiler kernel: ABORT_TASK: Sending
>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1169660
>> Mar 16 07:21:57 dracofiler kernel: ABORT_TASK: Found referenced
>> qla2xxx task_tag: 1169616
>> Mar 16 07:21:57 dracofiler kernel: ABORT_TASK: Sending
>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1169616
>> Mar 16 07:23:20 dracofiler kernel: ABORT_TASK: Found referenced
>> qla2xxx task_tag: 1147132
>> Mar 16 07:23:20 dracofiler kernel: ABORT_TASK: Sending
>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1147132
>> Mar 16 07:23:20 dracofiler kernel: ABORT_TASK: Found referenced
>> qla2xxx task_tag: 1147176
>> Mar 16 07:23:24 dracofiler kernel: ABORT_TASK: Sending
>> TMR_FUNCTION_COMPLETE for ref_tag: 1147176
>> Mar 16 07:23:24 dracofiler kernel: ABORT_TASK: Found referenced
>> qla2xxx task_tag: 1186556
>>
>> Also, I configured my hosts to send their logs to a syslog server, I
>> have an appointment to go to but I'll pull those and send them to you
>> this afternoon.
>>
>> Thanks
>> Dan
>
> I discovered the ATS heartbeat issue was still causing issues.  I have
> created a host profile and applied it to all of my hosts to ensure it
> doesn't come up again.  For now there's no reason to dig further with
> this, I will report back whether or not I'm still having the issue in
> the next few days (or sooner if it still fails).
>
> Thanks,
> Dan


Okay, back on track, I'm still seeing these aborts and eventually
losing access to the storage despite running the final 4.5 kernel and
VMFS3.UseATSForHBonVMFS5=0 set on all hosts.  According to the log and
what you have explained in the past, I think it still looks like I'm
using ATS heartbeat, but I may be wrong.  Note, it takes a lot longer
to fail than it used to, but I can still trigger the failure by
running ATTO repeatedly from a VM.  Also, for comparing between the
logs, my timezone is GMT/Zulu -4 (target server is local time,
vmkernel.log is zulu).

Here is my target log from the time period when it finally failed
(atto was running from a VM at this time):
Mar 16 23:30:49 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1165392
Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1165744
Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1165744
Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1165832
Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1165832
Mar 16 23:31:06 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1180616
Mar 16 23:31:06 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1180616
Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1182244
Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182244
Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1182200
Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182200
Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1182552
Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182552
Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1182508
Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182508
Mar 16 23:34:18 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1152236
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1152236
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1161124
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1161124
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1199888
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1199888
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1156680
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1166976
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1168164
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1168164
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1156680
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1156680
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1172784
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1134064
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1134064
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1174720
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1174720
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1134152
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1134152
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1215288
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1185368
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1185632
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1223648
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1192012
Mar 16 23:34:28 dracofiler kernel: Detected MISCOMPARE for addr:
ffff88062f84c000 buf: ffff88062d360c00
Mar 16 23:34:28 dracofiler kernel: Target/iblock: Send MISCOMPARE
check condition and sense
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1192012
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1201912
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1143128
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1176524
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1172476
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1135824
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1169044
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1173136
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1174060
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1198084
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1174192
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187084
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1146560
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1146604
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1196148
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1152280
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1144844
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1145768
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1146120
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1191748
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1159496
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1159540
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170848
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1219336
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1219424
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1179692
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1208600
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186028
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186160
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186204
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186248
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1193508
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1200548
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1151796
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1155712
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1185852
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1200284
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1219028
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1139300
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1150168
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1150212
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1136000
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1136000
Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced
qla2xxx task_tag: 1218896


Dan

Attachment: vmkernel.log
Description: Binary data


[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux