Re: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Pascal,

Adding the Cavium qla2xxx folks (Himanshu + Quinn) to CC'.

On Fri, 2017-06-02 at 14:10 +0200, Pascal de Bruijn wrote:
> Hi,
> 
> 
> We're setting up some storage servers where we're using lio/tcm_qla2xxx 
> to present some volumes via our FibreChannel fabrics to some VMware hosts.
> 
> We have two near identical servers, one connected to two single-switch 
> mini-fabrics, which has been operating fine. (This storage server has 
> two VMware hosts accessing the single LUN it presents).
> 
> The second storage server is connected to a larger multi-switch fabric 
> (with some zoning), which during testing has experienced a lockup, with 
> no clear cause visible on screen. We're still trying to reproduce. (This 
> storage server has nine VMware hosts accessing the single LUN it presents).
> 
> The lockup happened with 4.9.29, now after a minor update, from a new 
> boot, with a slightly updated kernel:
> 
> 
> Linux liohost01 4.9.30#4 SMP Fri Jun 2 10:16:13 CEST 2017 x86_64 GNU/Linux
> 
> 
> 81:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channe to PCI Express HBA (rev 02)
> 81:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
> 82:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
> 82:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
> 
> 
> [    7.661080] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 8.07.00.38-k.
> [    7.661786] qla2xxx [0000:81:00.0]-001d: : Found an ISP2532 irq 33 iobase 0xffffb874c6365000.
> [    7.872535] scsi host2: qla2xxx
> [    7.876854] qla2xxx [0000:81:00.0]-00fb:2: QLogic QLE2562 -  PCI-Express Dual Channel 8Gb Fibre Channel HBA.
> [    7.876863] qla2xxx [0000:81:00.0]-00fc:2: ISP2532: PCIe (5.0GT/s x8) @ 0000:81:00.0 hdma+ host#=2 fw=8.03.00 (90d5).
> [    7.877122] qla2xxx [0000:81:00.1]-001d: : Found an ISP2532 irq 114 iobase 0xffffb874c6425000.
> [    8.083587] scsi host3: qla2xxx
> [    8.087721] qla2xxx [0000:81:00.1]-00fb:3: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
> [    8.087730] qla2xxx [0000:81:00.1]-00fc:3: ISP2532: PCIe (5.0GT/s x8) @ 0000:81:00.1 hdma+ host#=3 fw=8.03.00 (90d5).
> [    8.087953] qla2xxx [0000:82:00.0]-001d: : Found an ISP2532 irq 35 iobase 0xffffb874c6435000.
> [    8.299587] scsi host4: qla2xxx
> [    8.303724] qla2xxx [0000:82:00.0]-00fb:4: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
> [    8.303733] qla2xxx [0000:82:00.0]-00fc:4: ISP2532: PCIe (5.0GT/s x8) @ 0000:82:00.0 hdma+ host#=4 fw=8.03.00 (90d5).
> [    8.303948] qla2xxx [0000:82:00.1]-001d: : Found an ISP2532 irq 119 iobase 0xffffb874c6445000.
> [    8.516620] scsi host5: qla2xxx
> [    8.520658] qla2xxx [0000:82:00.1]-00fb:5: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
> [    8.520667] qla2xxx [0000:82:00.1]-00fc:5: ISP2532: PCIe (5.0GT/s x8) @ 0000:82:00.1 hdma+ host#=5 fw=8.03.00 (90d5).
> [   30.511280] qla2xxx [0000:82:00.1]-00af:5: Performing ISP error recovery - ha=ffff9a1355130000.
> [   31.716856] qla2xxx [0000:82:00.1]-500a:5: LOOP UP detected (4 Gbps).
> [   35.656645] qla2xxx [0000:81:00.1]-00af:3: Performing ISP error recovery - ha=ffff9a11e6210000.
> [   36.880156] qla2xxx [0000:81:00.1]-500a:3: LOOP UP detected (4 Gbps).
> [   40.776863] qla2xxx [0000:82:00.0]-00af:4: Performing ISP error recovery - ha=ffff9a11e4c90000.
> [   41.993433] qla2xxx [0000:82:00.0]-500a:4: LOOP UP detected (4 Gbps).
> [   46.920062] qla2xxx [0000:81:00.0]-00af:2: Performing ISP error recovery - ha=ffff9a11e6ed0000.
> [   48.146786] qla2xxx [0000:81:00.0]-500a:2: LOOP UP detected (4 Gbps).
> 
> 
> We see some kernel messages on both storage servers:
> 
> [  557.363627] qla2xxx/21:00:00:24:ff:54:a4:b6: Unsupported SCSI Opcode 
> 0x85, sending CHECK_CONDITION.
> 
> You've already pointed out elsewhere on the list that this is not an 
> real issue.
> 
> 
> However, on the storage server that experienced the lockup, we do see 
> some kernel messages, that aren't present on the storage server that 
> didn't lock up:
> 
> [  739.250099] ABORT_TASK: Found referenced qla2xxx task_tag: 1184452
> [  739.250101] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 
> 1184452
> 
> I've seen this about 80 times over the past three hours.
> 
> I'd appreciate any pointers you could give me as to the nature of the 
> above kernel messages, and whether they warrant further investigation.


While the issues you've observing on the larger multi-switch fabric are
probably not related to VAAI, I wanted to point out a few things.

First, there is a well known ESX v5.5 u2+ host bug when VAAI
AtomicTestandSet (COMPARE_AND_WRITE) is used with the VMFS5 defaults,
which is enabled by default on all LIO exports.

The short of it is that every vendor recommends that you must
disable /VMFS3/UseATSForHBonVMFS5 on all ESX hosts as a work-around in
order to get a stable ESX setup with target that have VAAI
AtomicTestandSet enabled.  More details are here:

https://www.spinics.net/lists/target-devel/msg15173.html

Second, there have been a few reports on the list of people still having
issues with VAAI AtomicTestandSet even with /VMFS3/UseATSForHBonVMFS5
disabled on ESX host side.  One thing you can try is disabling ATS at
the target side using:

    echo 0 > /sys/kernel/config/target/core/$HBA/$DEV/attrib/emulate_caw

Note this only disables the COMPARE_AND_WRITE feature bit from being
reported during LUN scan, but doesn't reject COMPARE_AND_WRITE once it's
been disabled.  And since ESX doesn't allow VAAI ATS enabled VMFS5
volumes to fall back to non VAAI ATS operation, it may not be an option
for you.  However, since some folks have reported better stability with
qla2xxx/FC by explicitly disabling COMPARE_AND_WRITE on LIO-Target side,
it's worth considering for debugging.

Beyond that, Himanshu + Quinn can help with the next steps debugging
your issue.

Thanks for reporting.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux