Re: ESX FC host connectivty issues (was target crashes with vSphere 6 hosts)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 12, 2016 at 4:31 AM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Thu, 2016-02-11 at 22:33 -0500, Dan Lane wrote:
>> On Thu, Feb 11, 2016 at 9:44 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
>> >
>> > On Feb 11, 2016 8:58 PM, "Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx>
>> > wrote:
>> >>
>> >> On Thu, 2016-02-11 at 20:42 -0500, Dan Lane wrote:
>> >> > On Thu, Feb 11, 2016 at 8:19 PM, Nicholas A. Bellinger
>> >> > <nab@xxxxxxxxxxxxxxx> wrote:
>> >> > > On Thu, 2016-02-11 at 19:54 -0500, Dan Lane wrote:
>> >> > >
>> >> > > Top posting..
>> >> > SORRY!  I BLAME GOOGLE!!!
>> >> > >
>> >> > >> Well, looks like it wasn't as stable as we thought...
>> >> > >
>> >> > > Like I've already said multiple times, you need to find out what
>> >> > > component of your FC network is dropping packets.
>> >> > >
>> >> > >> Here is a clip
>> >> > >> from the logs, this is the only thing other than the ABORT_TASK I
>> >> > >> could find in the system logs.  Unfortunately I have no idea when it
>> >> > >> stopped responding to my hosts.
>> >> > >
>> >> > > How do you know it's the target that stopped responding..?
>> >> > >
>> >> > > ESX will eventually take a device offline if it's not consistently
>> >> > > getting responses, resulting in constant generation of ABORT_TASKs.
>> >> > >
>> >> > > Again, it's a clear sign that you're having some manner of FC
>> >> > > connectivity issues.
>> >> > >
>> >> > >>   My friend who was also testing this
>> >> > >> had virtually the same results (he also gets the frequent ABORT_TASK
>> >> > >> messages).
>> >> > >>
>> >> > >> Feb 10 20:33:48 dracofiler kernel: ABORT_TASK: Found referenced
>> >> > >> qla2xxx task_tag: 1167636
>> >> > >> Feb 10 20:33:48 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1167636
>> >> > >> Feb 10 20:34:07 dracofiler kernel: ABORT_TASK: Found referenced
>> >> > >> qla2xxx task_tag: 1183520
>> >> > >> Feb 10 20:34:07 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1183520
>> >> > >> Feb 10 20:44:31 dracofiler kernel: Unknown VPD Code: 0xc9
>> >> > >> Feb 10 20:44:33 dracofiler kernel: Unknown VPD Code: 0xc9
>> >> > >> Feb 10 20:44:47 dracofiler kernel: Unknown VPD Code: 0xc9
>> >> > >> Feb 10 20:46:35 dracofiler kernel: MODE SENSE: unimplemented
>> >> > >> page/subpage: 0x1c/0x02
>> >> > >> Feb 10 20:49:18 dracofiler kernel: ABORT_TASK: Found referenced
>> >> > >> qla2xxx task_tag: 1140928
>> >> > >> Feb 10 20:49:18 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1140928
>> >> > >> Feb 10 20:49:19 dracofiler kernel: ABORT_TASK: Found referenced
>> >> > >> qla2xxx task_tag: 1209480
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_FUNCTION_COMPLETE for ref_tag: 1209480
>> >> > >> Feb 10 20:49:29 dracofiler kernel: Detected MISCOMPARE for addr:
>> >> > >> ffff88062b253000 buf: ffff88062b6e7c00
>> >> > >> Feb 10 20:49:29 dracofiler kernel: Target/iblock: Send MISCOMPARE
>> >> > >> check condition and sense
>> >> > >> Feb 10 20:49:29 dracofiler kernel: Detected MISCOMPARE for addr:
>> >> > >> ffff880624bac000 buf: ffff88062b6e7c00
>> >> > >> Feb 10 20:49:29 dracofiler kernel: Target/iblock: Send MISCOMPARE
>> >> > >> check condition and sense
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1216828
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187260
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187348
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187392
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187436
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187480
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187524
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187304
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187568
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187656
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187744
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187788
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187832
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187920
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1188008
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1188052
>> >> > >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1188096
>> >> > >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Found referenced
>> >> > >> qla2xxx task_tag: 1202880
>> >> > >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1202880
>> >> > >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Found referenced
>> >> > >> qla2xxx task_tag: 1202968
>> >> > >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1202968
>> >> > >> Feb 10 20:51:37 dracofiler kernel: ABORT_TASK: Found referenced
>> >> > >> qla2xxx task_tag: 1204244
>> >> > >> Feb 10 20:51:37 dracofiler kernel: ABORT_TASK: Sending
>> >> > >> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1204244
>> >> > >>
>> >> > >
>> >> > > Nothing out of the ordinary here on the target side.
>> >> > >
>> >> > > So let's start again with the basics.  Please verify the:
>> >> > >
>> >> > >   - qla2xxx firmware version you're using on the target side.
>> >> > >   - FC HBA vendor, model and firmware version on the ESX side.
>> >> > >   - The FC switch vendor, model and firmware version.
>> >> > >
>> >> >
>> >> > Okay, while I work on collecting that information, can you provide
>> >> > some insight on the following?
>> >> >
>> >>
>> >> This looks like some rtslib-fb + systemd related failure.
>> >>
>> >> (Adding Andy CC')
>> >>
>> >> > [root@dracofiler init.d]# systemctl stop  target.service
>> >> > [root@dracofiler init.d]# systemctl start  target.service
>> >> > Job for target.service failed because a timeout was exceeded. See
>> >> > "systemctl status target.service" and "journalctl -xe" for details.
>> >>
>> >> Putting a qla2xxx port into target mode (including the ISP reset) can
>> >> take on the order of ~10 seconds per port
>> >>
>> >> It looks like your hitting some manner of systemd start timeout.
>> >>
>> >> > [root@dracofiler init.d]# systemctl status target.service
>> >> > ● target.service - Restore LIO kernel target configuration
>> >> >    Loaded: loaded (/usr/lib/systemd/system/target.service; enabled;
>> >> > vendor preset: disabled)
>> >> >    Active: failed (Result: timeout) since Thu 2016-02-11 20:23:51 EST;
>> >> > 5min ago
>> >> >   Process: 5063 ExecStart=/usr/bin/targetctl restore (code=exited,
>> >> > status=1/FAILURE)
>> >> >  Main PID: 5063 (code=exited, status=1/FAILURE)
>> >> >    CGroup: /system.slice/target.service
>> >> >            └─control
>> >> >              └─5009 /usr/bin/python3 /usr/bin/targetctl clear
>> >> >
>> >> > Feb 11 20:20:51 dracofiler.home.lan target[5063]: self.enable = False
>> >> > Feb 11 20:20:51 dracofiler.home.lan target[5063]: File
>> >> > "/usr/lib/python3.4/site-packages/rtslib_fb/target.py", line 245, in
>> >> > _set_enable
>> >> > Feb 11 20:20:51 dracofiler.home.lan target[5063]: raise
>> >> > RTSLibError("Cannot change enable state: %s" % e)
>> >> > Feb 11 20:20:51 dracofiler.home.lan target[5063]:
>> >> > rtslib_fb.utils.RTSLibError: Cannot change enable state: [Errno 1]
>> >> > Operation not permitted
>> >>
>> >> This means:
>> >>
>> >>    'echo $VAL > /sys/kernel/config/target/qla2xxx/$WWPN/$TPGT/enable'
>> >>
>> >> is failing with -EPERM.
>> >>
>> >> This typically means userspace is trying to enable target mode on a
>> >> endpoint that's already been enabled, or disabling target mode on a
>> >> endpoint that's already been disabled.
>> >>
>> >> So yeah, it looks like some type of rtslib-fb + systemd related problem.
>> >>
>> > Is this what's likely causing my problems (as opposed to just handling the
>> > service in this state), or should I continue collecting the information you
>> > asked for?
>> >
>
> It's not magic, rtslib-fb and systemd have nothing to do with dropped
> packets on your physical FC layer.  ;)
>
> So yeah, you'll want to collect everything about your setup in order to
> debug this further.  Otherwise get your hands on a FC analyzer, and
> figure out what's going on.
>
>> > Also, as I understood it, the driver loads the latest firmware on
>> > initialization, regardless of what's burned on the card, is this incorrect?
>>
>
> Correct, qla2xxx LLD loads ql2XXX_fw.bin from /lib/firmware/ provided by
> your distro at run-time:
>
> It's the 'fw=' string in dmesg:
>
> [    9.351902] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 8.07.00.26-k.
> [    9.352060] qla2xxx [0000:08:00.0]-001d: : Found an ISP2532 irq 31 iobase 0xffffc90000096000.
> [    9.601461] scsi host8: qla2xxx
> [    9.602059] qla2xxx [0000:08:00.0]-00fb:8: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
> [    9.602066] qla2xxx [0000:08:00.0]-00fc:8: ISP2532: PCIe (5.0GT/s x8) @ 0000:08:00.0 hdma+ host#=8 fw=7.03.00 (90d5).
> [    9.602248] qla2xxx [0000:08:00.1]-001d: : Found an ISP2532 irq 33 iobase 0xffffc9000008c000.
> [    9.849439] scsi host9: qla2xxx
> [    9.853751] qla2xxx [0000:08:00.1]-00fb:9: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
> [    9.864785] qla2xxx [0000:08:00.1]-00fc:9: ISP2532: PCIe (5.0GT/s x8) @ 0000:08:00.1 hdma+ host#=9 fw=7.03.00 (90d5).
> [ 1179.307251] qla2xxx [0000:08:00.0]-00af:8: Performing ISP error recovery - ha=ffff880818f00000.
>
>

I'm to the point now where LIO/target fails to start correctly no
matter what I do.  This happens even with nothing connected to the FC
cards and the configuration file (and backups) completely deleted, all
I have is the the new basic saveconfig.json that's created when I run
targetcli.  When I try running "service target start", it just sits
there... literally for days (I forgot I was working on this, lol).
There's nothing in any logs.

Can someone please recommend the best approach to resetting LIO/target
to the default state?  I figured I just need to uninstall and
reinstall, but I decided to inquire here first just in case...

Thanks
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux