On Wed, Mar 16, 2016 at 5:08 PM, Dan Lane <dracodan@xxxxxxxxx> wrote: > On Wed, Mar 16, 2016 at 10:47 AM, Dan Lane <dracodan@xxxxxxxxx> wrote: >> On Tue, Mar 15, 2016 at 3:49 PM, Dan Lane <dracodan@xxxxxxxxx> wrote: >>> On Tue, Mar 15, 2016 at 3:45 PM, Dan Lane <dracodan@xxxxxxxxx> wrote: >>>> I went to pull the latest source and I noticed mainline Kernel 4.5 was >>>> released yesterday. Did all the recent patches that apply to fiber >>>> channel make it into this release or do I still need to patch? >>>> >>>> Thanks >>>> Dan >>>> >>>> On Fri, Mar 11, 2016 at 11:07 PM, Nicholas A. Bellinger >>>> <nab@xxxxxxxxxxxxxxx> wrote: >>>>> On Fri, 2016-03-11 at 18:15 -0500, Dan Lane wrote: >>>>>> I'm back in town now and ready to try this again. Should I still try >>>>>> this patch? >>>>> >>>>> Yes, you still need to apply the patch to drop the extra bogus >>>>> target_put_sess_cmd() call, when !__target_check_io_state() for >>>>> ABORT_TASK occurs: >>>>> >>>>> https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?id=7f54ab5ff52fb0b91569bc69c4a6bc5cac1b768d >>>>> >>>>>> I noticed you had submitted a patch a few days ago, so >>>>>> can I just pull all the latest updates from your git repo? >>>>>> >>>>> >>>>> The PULL request just went out to Linus, and will be included for >>>>> v4.5 release. >>>>> >>>>> It's also CC'ed for stable, and will make it's way down to v3.14.y >>>>> stable over the next weeks. >>>>> >>> >>> Whoops, sorry about the top post... I know you said the request was >>> sent to Linus, I just wanted to confirm that it made it since the time >>> frame between that last email and when 4.5 was released was so short. >>> >>> Thanks again >> >> Latest update: >> With the 4.5 (final) kernel my storage was more stable than ever, but >> again went inaccessible after about 15 hours. This is despite very >> heavy usage last night, the likes of which caused failures in the past >> (but I was amazed with the performance, I was able to get 650MB/s >> writes and 750MB/s reads!!!). The aborts seem to be coming in as >> steady as they have in the past, which leads me to believe the patch >> for the extra "bogus target_put_sess_cmd() call" didn't make it in >> time for the 4.5 release. If it did, this means there are more >> problems. >> >> Here is a snippet from my messages log before ESXi gave up: >> Mar 16 07:21:57 dracofiler kernel: ABORT_TASK: Sending >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1169660 >> Mar 16 07:21:57 dracofiler kernel: ABORT_TASK: Found referenced >> qla2xxx task_tag: 1169616 >> Mar 16 07:21:57 dracofiler kernel: ABORT_TASK: Sending >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1169616 >> Mar 16 07:23:20 dracofiler kernel: ABORT_TASK: Found referenced >> qla2xxx task_tag: 1147132 >> Mar 16 07:23:20 dracofiler kernel: ABORT_TASK: Sending >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1147132 >> Mar 16 07:23:20 dracofiler kernel: ABORT_TASK: Found referenced >> qla2xxx task_tag: 1147176 >> Mar 16 07:23:24 dracofiler kernel: ABORT_TASK: Sending >> TMR_FUNCTION_COMPLETE for ref_tag: 1147176 >> Mar 16 07:23:24 dracofiler kernel: ABORT_TASK: Found referenced >> qla2xxx task_tag: 1186556 >> >> Also, I configured my hosts to send their logs to a syslog server, I >> have an appointment to go to but I'll pull those and send them to you >> this afternoon. >> >> Thanks >> Dan > > I discovered the ATS heartbeat issue was still causing issues. I have > created a host profile and applied it to all of my hosts to ensure it > doesn't come up again. For now there's no reason to dig further with > this, I will report back whether or not I'm still having the issue in > the next few days (or sooner if it still fails). > > Thanks, > Dan Okay, back on track, I'm still seeing these aborts and eventually losing access to the storage despite running the final 4.5 kernel and VMFS3.UseATSForHBonVMFS5=0 set on all hosts. According to the log and what you have explained in the past, I think it still looks like I'm using ATS heartbeat, but I may be wrong. Note, it takes a lot longer to fail than it used to, but I can still trigger the failure by running ATTO repeatedly from a VM. Also, for comparing between the logs, my timezone is GMT/Zulu -4 (target server is local time, vmkernel.log is zulu). Here is my target log from the time period when it finally failed (atto was running from a VM at this time): Mar 16 23:30:49 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1165392 Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1165744 Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1165744 Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1165832 Mar 16 23:30:51 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1165832 Mar 16 23:31:06 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1180616 Mar 16 23:31:06 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1180616 Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1182244 Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182244 Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1182200 Mar 16 23:31:09 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182200 Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1182552 Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182552 Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1182508 Mar 16 23:31:11 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1182508 Mar 16 23:34:18 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1152236 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1152236 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1161124 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1161124 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1199888 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1199888 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1156680 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1166976 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1168164 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1168164 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1156680 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1156680 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1172784 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1134064 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1134064 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1174720 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1174720 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1134152 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1134152 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1215288 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1185368 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1185632 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1223648 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1192012 Mar 16 23:34:28 dracofiler kernel: Detected MISCOMPARE for addr: ffff88062f84c000 buf: ffff88062d360c00 Mar 16 23:34:28 dracofiler kernel: Target/iblock: Send MISCOMPARE check condition and sense Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1192012 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1201912 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1143128 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1176524 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1172476 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1135824 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1169044 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1173136 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1174060 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1198084 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1174192 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187084 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1146560 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1146604 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1196148 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1152280 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1144844 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1145768 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1146120 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1191748 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1159496 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1159540 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170848 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1219336 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1219424 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1179692 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1208600 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186028 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186160 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186204 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1186248 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1193508 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1200548 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1151796 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1155712 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1185852 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1200284 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1219028 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1139300 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1150168 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1150212 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1136000 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1136000 Mar 16 23:34:28 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1218896 Dan
Attachment:
vmkernel.log
Description: Binary data