Re: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag

Pascal de Bruijn <p.debruijn@xxxxxxxxxxx> · Tue, 6 Jun 2017 09:52:31 +0200

On 03-06-17 07:17, Nicholas A. Bellinger wrote:
Hi Pascal,

Adding the Cavium qla2xxx folks (Himanshu + Quinn) to CC'.

On Fri, 2017-06-02 at 14:10 +0200, Pascal de Bruijn wrote:
Hi,

We're setting up some storage servers where we're using lio/tcm_qla2xxx
to present some volumes via our FibreChannel fabrics to some VMware hosts.

We have two near identical servers, one connected to two single-switch
mini-fabrics, which has been operating fine. (This storage server has
two VMware hosts accessing the single LUN it presents).

The second storage server is connected to a larger multi-switch fabric
(with some zoning), which during testing has experienced a lockup, with
no clear cause visible on screen. We're still trying to reproduce. (This
storage server has nine VMware hosts accessing the single LUN it presents).

The lockup happened with 4.9.29, now after a minor update, from a new
boot, with a slightly updated kernel:

Linux liohost01 4.9.30#4 SMP Fri Jun 2 10:16:13 CEST 2017 x86_64 GNU/Linux

81:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channe to PCI Express HBA (rev 02)
81:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
82:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
82:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)

[    7.661080] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 8.07.00.38-k.
[    7.661786] qla2xxx [0000:81:00.0]-001d: : Found an ISP2532 irq 33 iobase 0xffffb874c6365000.
[    7.872535] scsi host2: qla2xxx
[    7.876854] qla2xxx [0000:81:00.0]-00fb:2: QLogic QLE2562 -  PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    7.876863] qla2xxx [0000:81:00.0]-00fc:2: ISP2532: PCIe (5.0GT/s x8) @ 0000:81:00.0 hdma+ host#=2 fw=8.03.00 (90d5).
[    7.877122] qla2xxx [0000:81:00.1]-001d: : Found an ISP2532 irq 114 iobase 0xffffb874c6425000.
[    8.083587] scsi host3: qla2xxx
[    8.087721] qla2xxx [0000:81:00.1]-00fb:3: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    8.087730] qla2xxx [0000:81:00.1]-00fc:3: ISP2532: PCIe (5.0GT/s x8) @ 0000:81:00.1 hdma+ host#=3 fw=8.03.00 (90d5).
[    8.087953] qla2xxx [0000:82:00.0]-001d: : Found an ISP2532 irq 35 iobase 0xffffb874c6435000.
[    8.299587] scsi host4: qla2xxx
[    8.303724] qla2xxx [0000:82:00.0]-00fb:4: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    8.303733] qla2xxx [0000:82:00.0]-00fc:4: ISP2532: PCIe (5.0GT/s x8) @ 0000:82:00.0 hdma+ host#=4 fw=8.03.00 (90d5).
[    8.303948] qla2xxx [0000:82:00.1]-001d: : Found an ISP2532 irq 119 iobase 0xffffb874c6445000.
[    8.516620] scsi host5: qla2xxx
[    8.520658] qla2xxx [0000:82:00.1]-00fb:5: QLogic QLE2562 - PCI-Express Dual Channel 8Gb Fibre Channel HBA.
[    8.520667] qla2xxx [0000:82:00.1]-00fc:5: ISP2532: PCIe (5.0GT/s x8) @ 0000:82:00.1 hdma+ host#=5 fw=8.03.00 (90d5).
[   30.511280] qla2xxx [0000:82:00.1]-00af:5: Performing ISP error recovery - ha=ffff9a1355130000.
[   31.716856] qla2xxx [0000:82:00.1]-500a:5: LOOP UP detected (4 Gbps).
[   35.656645] qla2xxx [0000:81:00.1]-00af:3: Performing ISP error recovery - ha=ffff9a11e6210000.
[   36.880156] qla2xxx [0000:81:00.1]-500a:3: LOOP UP detected (4 Gbps).
[   40.776863] qla2xxx [0000:82:00.0]-00af:4: Performing ISP error recovery - ha=ffff9a11e4c90000.
[   41.993433] qla2xxx [0000:82:00.0]-500a:4: LOOP UP detected (4 Gbps).
[   46.920062] qla2xxx [0000:81:00.0]-00af:2: Performing ISP error recovery - ha=ffff9a11e6ed0000.
[   48.146786] qla2xxx [0000:81:00.0]-500a:2: LOOP UP detected (4 Gbps).

We see some kernel messages on both storage servers:

[  557.363627] qla2xxx/21:00:00:24:ff:54:a4:b6: Unsupported SCSI Opcode
0x85, sending CHECK_CONDITION.

You've already pointed out elsewhere on the list that this is not an
real issue.

However, on the storage server that experienced the lockup, we do see
some kernel messages, that aren't present on the storage server that
didn't lock up:

[  739.250099] ABORT_TASK: Found referenced qla2xxx task_tag: 1184452
[  739.250101] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag:
1184452

I've seen this about 80 times over the past three hours.

I'd appreciate any pointers you could give me as to the nature of the
above kernel messages, and whether they warrant further investigation.

While the issues you've observing on the larger multi-switch fabric are
probably not related to VAAI, I wanted to point out a few things.

First, there is a well known ESX v5.5 u2+ host bug when VAAI
AtomicTestandSet (COMPARE_AND_WRITE) is used with the VMFS5 defaults,
which is enabled by default on all LIO exports.

The short of it is that every vendor recommends that you must
disable /VMFS3/UseATSForHBonVMFS5 on all ESX hosts as a work-around in
order to get a stable ESX setup with target that have VAAI
AtomicTestandSet enabled.  More details are here:

https://www.spinics.net/lists/target-devel/msg15173.html

Second, there have been a few reports on the list of people still having
issues with VAAI AtomicTestandSet even with /VMFS3/UseATSForHBonVMFS5
disabled on ESX host side.  One thing you can try is disabling ATS at
the target side using:

     echo 0 > /sys/kernel/config/target/core/$HBA/$DEV/attrib/emulate_caw

Note this only disables the COMPARE_AND_WRITE feature bit from being
reported during LUN scan, but doesn't reject COMPARE_AND_WRITE once it's
been disabled.  And since ESX doesn't allow VAAI ATS enabled VMFS5
volumes to fall back to non VAAI ATS operation, it may not be an option
for you.  However, since some folks have reported better stability with
qla2xxx/FC by explicitly disabling COMPARE_AND_WRITE on LIO-Target side,
it's worth considering for debugging.
Thanks for the heads up. Presumably the cleanest way to handle this is 
to set that property via targetcli, so it ends up in saveconfig?:

/backstores/block/mydev set attribute emulate_caw=0

Though the lack of the above should cause issues on the VMware side of 
things right? Not as much the LIO side of things...
Beyond that, Himanshu + Quinn can help with the next steps debugging
your issue.

We found the follow in remote syslog:

Jun  5 23:45:58 ourhost kernel: [37722.944891] 
qla2xxx/21:00:00:24:ff:54:a1:32: Unsupported SCSI Opcode 0x85, sending 
CHECK_CONDITION.
Jun  5 23:46:03 ourhost kernel: [37728.246882] ABORT_TASK: Found 
referenced qla2xxx task_tag: 1205188
Jun  5 23:46:03 ourhost kernel: [37728.246884] ABORT_TASK: Sending 
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1205188
Jun  5 23:46:03 ourhost kernel: [37728.246889] ABORT_TASK: Found 
referenced qla2xxx task_tag: 1205236
Jun  5 23:46:03 ourhost kernel: [37728.246890] ABORT_TASK: Sending 
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1205236
Jun  5 23:46:04 ourhost kernel: [37728.976873] BUG: unable to handle 
kernel NULL pointer dereference at 00000000000000a8
Jun  5 23:46:04 ourhost kernel: [37728.976950] IP: [<ffffffffc09faad9>] 
target_complete_ok_work+0x119/0x2a0 [target_core_mod]
Jun  5 23:46:04 ourhost kernel: [37728.977030] PGD 0
Jun  5 23:46:04 ourhost kernel: [37728.977040]
Jun  5 23:46:04 ourhost kernel: [37728.977074] Oops: 0002 [#1] SMP
Jun  5 23:46:04 ourhost kernel: [37728.977107] Modules linked in: 
tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file 
target_core_iblock iscsi_target_mod target_core_mod configfs raid456 
async_raid6_recov async_memcpy async_pq async_xor xor as
Jun  5 23:46:04 ourhost kernel: [37728.977510] CPU: 5 PID: 4765 Comm: 
kworker/5:1 Not tainted 4.9.30-ul4 #4
Jun  5 23:46:04 ourhost kernel: [37728.977551] Hardware name: Supermicro 
SYS-1028U-E1CRTP+/X10DRU-i+, BIOS 2.0b 08/09/2016
Jun  5 23:46:04 ourhost kernel: [37728.977621] Workqueue: 
target_completion target_complete_ok_work [target_core_mod]
Jun  5 23:46:04 ourhost kernel: [37728.977687] task: ffff897550465e00 
task.stack: ffffac0264a5c000
Jun  5 23:46:04 ourhost kernel: [37728.977727] RIP: 
0010:[<ffffffffc09faad9>]  [<ffffffffc09faad9>] 
target_complete_ok_work+0x119/0x2a0 [target_core_mod]

And for the same event we found this on the terminal (this was manually 
typed from a screenshot taken with a phone, some content was replaced 
with ...):

[<...>] ? __schedule+0x63/0x510
[<...>] ? schedule+0x31/0x80
[<...>] ? do_exit+0x82e/0xa00
[<...>] ? rewind_stack_do_exit+0x17x0x19
ABORT_TASK: Found referenced qla2xxx task_tag: 1218964
ABORT_TASK: Found referenced qla2xxx task_tag: 1241380
mpt3sas_cmd0: _base_event_notification: timeout
mf:
...
...
...
...
sd 1:0:2:0 attempting task abort! scmd(ffff896d51e68d58)
sd 1:0:2:0 [sdd] tag#2118 CDB: ATA command pass through(16) 85 08 2e 00 
00 00 01 00 00 00 00 00 00 00 ec 00
scsi target1:0:2: handle(0x000c), sas_address(0x500304801f232d02), ph(2)
scsi target1:0:2: enclosure_logical_id(0x500304801f232d3f), slot(2)
scsi target1:0:2: enclosure_level(0x0000),connector name(     )

On the same machine we came across this as well, though not directly 
related to the lockup illustrated above:
TARGET_CORE[qla2xxx]: Detected NON_EXISTANT_LUN Access for 0x000000f5

It's sister machine (same hardware) also experienced a lockup this 
weekend, though with more unexpected track trace:

vprintk_default
put_dec
number
vgacon_scroll
vgacon_set_cursor_size.isra.
atomic_notifier_call_chain
vt_console_print
up
console_unlock
vprintk_emit
vprintk_default
_raw_spin_lock
__schedule
schedule
do_exit
rewind_stack_do_exit

We're going to see if we can arrange for serial console output and 
logging, so we have more complete view of the traces.

Any guidance toward troubleshooting this more effectively would be much 
appreciated in the meanwhile.

Regards,
Pascal de Bruijn

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html