ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag

Pascal de Bruijn <p.debruijn@xxxxxxxxxxx> · Fri, 2 Jun 2017 14:10:06 +0200

Hi,

We're setting up some storage servers where we're using lio/tcm_qla2xxx 
to present some volumes via our FibreChannel fabrics to some VMware hosts.

We have two near identical servers, one connected to two single-switch 
mini-fabrics, which has been operating fine. (This storage server has 
two VMware hosts accessing the single LUN it presents).

The second storage server is connected to a larger multi-switch fabric 
(with some zoning), which during testing has experienced a lockup, with 
no clear cause visible on screen. We're still trying to reproduce. (This 
storage server has nine VMware hosts accessing the single LUN it presents).

The lockup happened with 4.9.29, now after a minor update, from a new 
boot, with a slightly updated kernel:

Linux liohost01 4.9.30#4 SMP Fri Jun 2 10:16:13 CEST 2017 x86_64 GNU/Linux

81:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to 
PCI Express HBA (rev 02)
81:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to 
PCI Express HBA (rev 02)
82:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to 
PCI Express HBA (rev 02)
82:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to 
PCI Express HBA (rev 02)

[    7.661080] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA 
Driver: 8.07.00.38-k.
[    7.661786] qla2xxx [0000:81:00.0]-001d: : Found an ISP2532 irq 33 
iobase 0xffffb874c6365000.
[    7.872535] scsi host2: qla2xxx
[    7.876854] qla2xxx [0000:81:00.0]-00fb:2: QLogic QLE2562 - 
PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    7.876863] qla2xxx [0000:81:00.0]-00fc:2: ISP2532: PCIe (5.0GT/s x8) 
@ 0000:81:00.0 hdma+ host#=2 fw=8.03.00 (90d5).
[    7.877122] qla2xxx [0000:81:00.1]-001d: : Found an ISP2532 irq 114 
iobase 0xffffb874c6425000.
[    8.083587] scsi host3: qla2xxx
[    8.087721] qla2xxx [0000:81:00.1]-00fb:3: QLogic QLE2562 - 
PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    8.087730] qla2xxx [0000:81:00.1]-00fc:3: ISP2532: PCIe (5.0GT/s x8) 
@ 0000:81:00.1 hdma+ host#=3 fw=8.03.00 (90d5).
[    8.087953] qla2xxx [0000:82:00.0]-001d: : Found an ISP2532 irq 35 
iobase 0xffffb874c6435000.
[    8.299587] scsi host4: qla2xxx
[    8.303724] qla2xxx [0000:82:00.0]-00fb:4: QLogic QLE2562 - 
PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    8.303733] qla2xxx [0000:82:00.0]-00fc:4: ISP2532: PCIe (5.0GT/s x8) 
@ 0000:82:00.0 hdma+ host#=4 fw=8.03.00 (90d5).
[    8.303948] qla2xxx [0000:82:00.1]-001d: : Found an ISP2532 irq 119 
iobase 0xffffb874c6445000.
[    8.516620] scsi host5: qla2xxx
[    8.520658] qla2xxx [0000:82:00.1]-00fb:5: QLogic QLE2562 - 
PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    8.520667] qla2xxx [0000:82:00.1]-00fc:5: ISP2532: PCIe (5.0GT/s x8) 
@ 0000:82:00.1 hdma+ host#=5 fw=8.03.00 (90d5).
[   30.511280] qla2xxx [0000:82:00.1]-00af:5: Performing ISP error 
recovery - ha=ffff9a1355130000.
[   31.716856] qla2xxx [0000:82:00.1]-500a:5: LOOP UP detected (4 Gbps).
[   35.656645] qla2xxx [0000:81:00.1]-00af:3: Performing ISP error 
recovery - ha=ffff9a11e6210000.
[   36.880156] qla2xxx [0000:81:00.1]-500a:3: LOOP UP detected (4 Gbps).
[   40.776863] qla2xxx [0000:82:00.0]-00af:4: Performing ISP error 
recovery - ha=ffff9a11e4c90000.
[   41.993433] qla2xxx [0000:82:00.0]-500a:4: LOOP UP detected (4 Gbps).
[   46.920062] qla2xxx [0000:81:00.0]-00af:2: Performing ISP error 
recovery - ha=ffff9a11e6ed0000.
[   48.146786] qla2xxx [0000:81:00.0]-500a:2: LOOP UP detected (4 Gbps).

We see some kernel messages on both storage servers:

[  557.363627] qla2xxx/21:00:00:24:ff:54:a4:b6: Unsupported SCSI Opcode 
0x85, sending CHECK_CONDITION.

You've already pointed out elsewhere on the list that this is not an 
real issue.

However, on the storage server that experienced the lockup, we do see 
some kernel messages, that aren't present on the storage server that 
didn't lock up:

[  739.250099] ABORT_TASK: Found referenced qla2xxx task_tag: 1184452
[  739.250101] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 
1184452

I've seen this about 80 times over the past three hours.

I'd appreciate any pointers you could give me as to the nature of the 
above kernel messages, and whether they warrant further investigation.

Regards,
Pascal de Bruijn

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html