Re: ESX FC host connectivty issues (was target crashes with vSphere 6 hosts)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2016-02-11 at 20:42 -0500, Dan Lane wrote:
> On Thu, Feb 11, 2016 at 8:19 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > On Thu, 2016-02-11 at 19:54 -0500, Dan Lane wrote:
> >
> > Top posting..
> SORRY!  I BLAME GOOGLE!!!
> >
> >> Well, looks like it wasn't as stable as we thought...
> >
> > Like I've already said multiple times, you need to find out what
> > component of your FC network is dropping packets.
> >
> >> Here is a clip
> >> from the logs, this is the only thing other than the ABORT_TASK I
> >> could find in the system logs.  Unfortunately I have no idea when it
> >> stopped responding to my hosts.
> >
> > How do you know it's the target that stopped responding..?
> >
> > ESX will eventually take a device offline if it's not consistently
> > getting responses, resulting in constant generation of ABORT_TASKs.
> >
> > Again, it's a clear sign that you're having some manner of FC
> > connectivity issues.
> >
> >>   My friend who was also testing this
> >> had virtually the same results (he also gets the frequent ABORT_TASK
> >> messages).
> >>
> >> Feb 10 20:33:48 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1167636
> >> Feb 10 20:33:48 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1167636
> >> Feb 10 20:34:07 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1183520
> >> Feb 10 20:34:07 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1183520
> >> Feb 10 20:44:31 dracofiler kernel: Unknown VPD Code: 0xc9
> >> Feb 10 20:44:33 dracofiler kernel: Unknown VPD Code: 0xc9
> >> Feb 10 20:44:47 dracofiler kernel: Unknown VPD Code: 0xc9
> >> Feb 10 20:46:35 dracofiler kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02
> >> Feb 10 20:49:18 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1140928
> >> Feb 10 20:49:18 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1140928
> >> Feb 10 20:49:19 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1209480
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1209480
> >> Feb 10 20:49:29 dracofiler kernel: Detected MISCOMPARE for addr: ffff88062b253000 buf: ffff88062b6e7c00
> >> Feb 10 20:49:29 dracofiler kernel: Target/iblock: Send MISCOMPARE check condition and sense
> >> Feb 10 20:49:29 dracofiler kernel: Detected MISCOMPARE for addr: ffff880624bac000 buf: ffff88062b6e7c00
> >> Feb 10 20:49:29 dracofiler kernel: Target/iblock: Send MISCOMPARE check condition and sense
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1216828
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187260
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187348
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187392
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187436
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187480
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187524
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187304
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187568
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187656
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187744
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187788
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187832
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1187920
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1188008
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1188052
> >> Feb 10 20:49:29 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1188096
> >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1202880
> >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1202880
> >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1202968
> >> Feb 10 20:51:18 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1202968
> >> Feb 10 20:51:37 dracofiler kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1204244
> >> Feb 10 20:51:37 dracofiler kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1204244
> >>
> >
> > Nothing out of the ordinary here on the target side.
> >
> > So let's start again with the basics.  Please verify the:
> >
> >   - qla2xxx firmware version you're using on the target side.
> >   - FC HBA vendor, model and firmware version on the ESX side.
> >   - The FC switch vendor, model and firmware version.
> >
> 
> Okay, while I work on collecting that information, can you provide
> some insight on the following?
> 

This looks like some rtslib-fb + systemd related failure.

(Adding Andy CC')

> [root@dracofiler init.d]# systemctl stop  target.service
> [root@dracofiler init.d]# systemctl start  target.service
> Job for target.service failed because a timeout was exceeded. See
> "systemctl status target.service" and "journalctl -xe" for details.

Putting a qla2xxx port into target mode (including the ISP reset) can
take on the order of ~10 seconds per port

It looks like your hitting some manner of systemd start timeout.

> [root@dracofiler init.d]# systemctl status target.service
> ● target.service - Restore LIO kernel target configuration
>    Loaded: loaded (/usr/lib/systemd/system/target.service; enabled;
> vendor preset: disabled)
>    Active: failed (Result: timeout) since Thu 2016-02-11 20:23:51 EST; 5min ago
>   Process: 5063 ExecStart=/usr/bin/targetctl restore (code=exited,
> status=1/FAILURE)
>  Main PID: 5063 (code=exited, status=1/FAILURE)
>    CGroup: /system.slice/target.service
>            └─control
>              └─5009 /usr/bin/python3 /usr/bin/targetctl clear
> 
> Feb 11 20:20:51 dracofiler.home.lan target[5063]: self.enable = False
> Feb 11 20:20:51 dracofiler.home.lan target[5063]: File
> "/usr/lib/python3.4/site-packages/rtslib_fb/target.py", line 245, in
> _set_enable
> Feb 11 20:20:51 dracofiler.home.lan target[5063]: raise
> RTSLibError("Cannot change enable state: %s" % e)
> Feb 11 20:20:51 dracofiler.home.lan target[5063]:
> rtslib_fb.utils.RTSLibError: Cannot change enable state: [Errno 1]
> Operation not permitted

This means:

   'echo $VAL > /sys/kernel/config/target/qla2xxx/$WWPN/$TPGT/enable'

is failing with -EPERM.

This typically means userspace is trying to enable target mode on a
endpoint that's already been enabled, or disabling target mode on a
endpoint that's already been disabled.

So yeah, it looks like some type of rtslib-fb + systemd related problem.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux