qla2xxx crashing kernel

Craig Watson <craig.watson@xxxxxxxxxxxxxxxxxxx> · Fri, 05 Sep 2014 19:53:57 -0400

Hi Nicholas,

One of our systems is seeing a kernel BUG with a fibre channel target
system.  The system is an Intel i7 with a qle2562 FC HBA running FC20
with kernel 3.14.4-200.  (We have seen this with several older kernel
versions too).

The LIO component versions are:
    targetcli    2.1.fb35-1
    rtslib       2.1.fb47-1
    configshell  1.1.fb13-1

The bug/crash is repeatable.  It happens every time the LIO target
system is attached to the initiator.  The target is connected to an
older VxWorks initiator system.  This system has been in service for
years and works with several models of FC disk drives.  The new
LIO-based system is an attempt to replace aging hardware with newer,
more reliable, and higher performance storage.

Here is the console output of the kernel BUG:

[ 2669.734186] ------------[ cut here ]------------
[ 2669.738893] kernel BUG at drivers/scsi/qla2xxx/qla_target.c:1343!
[ 2669.745080] invalid opcode: 0000 [#1] SMP
[ 2669.749275] Modules linked in: tcm_qla2xxx target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
cfg80211 rfkill snd_hda_codec_realtek snd_hda_codec_generic
snd_hda_intel snd_hda_codec serio_raw snd_hwdep snd_pcm snd_timer
x86_pkg_temp_thermal snd e1000e ptp iTCO_wdt mei_me iTCO_vendor_support
mei lpc_ich mfd_core raid0 i2c_i801 shpchp pps_core soundcore coretemp
kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel microcode usb_storage ata_generic pata_acpi i915
pata_jmicron qla2xxx i2c_algo_bit drm_kms_helper drm scsi_transport_fc
scsi_tgt i2c_core video
[ 2669.805324] CPU: 1 PID: 558 Comm: kworker/1:0 Not tainted
               3.14.4-200.fc20.x86_64 #1
[ 2669.813052] Hardware name: Intel Corporation ChiefRiver
               Platform/PCOM-B219VG, BIOS R1.00.W3 09/24/2012
[ 2669.822485] Workqueue: qla_tgt_wq qlt_do_work [qla2xxx]
[ 2669.827810] task: ffff8800d28f1c80 ti: ffff8800d3eea000
               task.ti: ffff8800d3eea000
[ 2669.835399] RIP: 0010:[<ffffffffa00edfe4>] [<ffffffffa00edfe4>]
               qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2669.845308] RSP: 0018:ffff8800d3eebba8  EFLAGS: 00010246
[ 2669.850712] RAX: ffff880117d3b400 RBX: ffff880036b9d6f0
               RCX: ffffffffa010a19a
[ 2669.857957] RDX: 0000000000000000 RSI: ffff880036b9d6f0 RDI: 
ffff8800d3eebc00
[ 2669.865190] RBP: ffff8800d3eebbe0 R08: 0000000000000000 R09: 
0000000000000020
[ 2669.872418] R10: 0000000000000002 R11: 0000000000000000 R12: 
ffff8800d3eebc00
[ 2669.879646] R13: ffff8800d3eebc00 R14: ffff8800d1e30000 R15: ffff880117d3
[ 2669.886897] FS:  0000000000000000(0000) GS:ffff88011e240000(0000)
               knlGS:0000000000000000
[ 2669.895102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2669.900937] CR2: 00007f5efb75b000 CR3: 0000000001c0c000 CR4: 
00000000001407e0
[ 2669.908144] Stack:
[ 2669.910187]  000000006acfe608 0000000100000000 ffff880036b9d6f0
                ffff880117d3b400
[ 2669.917797]  ffff8800d3eebc00 ffff8800d1e30000 ffff880117d3b72c
                ffff8800d3eebc78
[ 2669.925387]  ffffffffa00ef933 0000000000000000 0000000000000000
                ffff880117d3b400
[ 2669.932969] Call Trace:
[ 2669.935455]  [<ffffffffa00ef933>] qlt_rdy_to_xfer+0x93/0x210 [qla2xxx]
[ 2669.942070]  [<ffffffffa03c267c>] tcm_qla2xxx_write_pending+0x5c/0x60
                [tcm_qla2xxx]
[ 2669.949855]  [<ffffffffa04ff677>] transport_generic_new_cmd+0x127/0x1a0
                [target_core_mod]
[ 2669.958184]  [<ffffffffa04ff72b>] transport_handle_cdb_direct+0x3b/0x90
                [target_core_mod]
[ 2669.966510]  [<ffffffffa04ff8c2>] target_submit_cmd_map_sgls+0x142/0x280
                [target_core_mod]
[ 2669.974917]  [<ffffffffa04ffa59>] target_submit_cmd+0x59/0x60
                [target_core_mod]
[ 2669.982336]  [<ffffffffa03c25ea>] tcm_qla2xxx_handle_cmd+0x5a/0x90
                [tcm_qla2xxx]
[ 2669.989847]  [<ffffffffa00f0624>] qlt_do_work+0x2e4/0x460 [qla2xxx]
[ 2669.996231]  [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 2670.002163]  [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 2670.007809]  [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 2670.013710]  [<ffffffff810ae211>] kthread+0xe1/0x100
[ 2670.018763]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.024943]  [<ffffffff816ff4bc>] ret_from_fork+0x7c/0xb0
[ 2670.031673]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.039056] Code: 0f 1f 80 00 00 00 00 45 8b 4c 24 24 eb be 66 0f 1f 
84 00
                     00 00 00 00 48 c7 45 c8 00 00 00 00 4c 8b 3d 01 a6 
b2 e1
                     e9 18 ff ff ff <0f> 0b 48 8b b0 f8 02 00 00 44 8b 
88 d0
                     02 00 00 45 31 c0 31 c0
[ 2670.062304] RIP  [<ffffffffa00edfe4>] qlt_pci_map_calc_cnt+0x164/0x1a0
                    [qla2xxx]
[ 2670.071041]  RSP <ffff8800d3eebba8>
[ 2670.201729] ---[ end trace ec1db6cbc3980106 ]---
[ 2670.207583] BUG: unable to handle kernel paging request at 
ffffffffffffffd8
[ 2670.215756] IP: [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.222563] PGD 1c0f067 PUD 1c11067 PMD 0
[ 2670.227968] Oops: 0000 [#2] SMP
[ 2670.232458] Modules linked in: tcm_qla2xxx target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod 
cfg80211
rfkill snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel 
snd_hda_codec
serio_raw snd_hwdep snd_pcm snd_timer x86_pkg_temp_thermal snd e1000e ptp
iTCO_wdt mei_me iTCO_vendor_support mei lpc_ich mfd_core raid0 i2c_i801 
shpchp
pps_core soundcore coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul
crc32c_intel ghash_clmulni_intel microcode usb_storage ata_generic pata_acpi
i915 pata_jmicron qla2xxx i2c_algo_bit drm_kms_helper drm scsi_transport_fc
scsi_tgt i2c_core video
[ 2670.293550] CPU: 1 PID: 558 Comm: kworker/1:0 Tainted: G D
               3.14.4-200.fc20.x86_64 #1
[ 2670.303354] Hardware name: Intel Corporation ChiefRiver 
Platform/PCOM-B219VG,
               BIOS R1.00.W3 09/24/2012
[ 2670.313751] task: ffff8800d28f1c80 ti: ffff8800d3eea000
               task.ti: ffff8800d3eea000
[ 2670.322257] RIP: 0010:[<ffffffff810ae800>] [<ffffffff810ae800>]
               kthread_data+0x10/0x20
[ 2670.331302] RSP: 0018:ffff8800d3eeb880  EFLAGS: 00010002
[ 2670.337864] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 
000000000000000f
[ 2670.346306] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 
ffff8800d28f1c80
[ 2670.354446] RBP: ffff8800d3eeb880 R08: ffff8800d28f1d10 R09: 
00000001800f000c
[ 2670.362579] R10: ffffffff8108760c R11: ffffea00034fc200 R12: 
ffff88011e254640
[ 2670.370972] R13: 0000000000000001 R14: ffff8800d28f1c70 R15: 
ffff8800d28f1c80
[ 2670.379385] FS:  0000000000000000(0000) GS:ffff88011e240000(0000)
               knlGS:0000000000000000
[ 2670.388826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2670.395561] CR2: 0000000000000028 CR3: 0000000001c0c000 CR4: 
00000000001407e0
[ 2670.403681] Stack:
[ 2670.406688]  ffff8800d3eeb898 ffffffff810a7b91 ffff8800d28f1c80
                ffff8800d3eeb8f8
[ 2670.415307]  ffffffff816f2b89 ffff8800d28f1c80 ffff8800d3eebfd8
                0000000000014640
[ 2670.424064]  0000000000014640 ffff8800d28f1c80 ffff8800d28f2318
                ffff8800d3eeb5d0
[ 2670.432560] Call Trace:
[ 2670.436195]  [<ffffffff810a7b91>] wq_worker_sleeping+0x11/0x90
[ 2670.443065]  [<ffffffff816f2b89>] __schedule+0x4a9/0x740
[ 2670.449512]  [<ffffffff816f2e49>] schedule+0x29/0x70
[ 2670.455693]  [<ffffffff8108cb47>] do_exit+0x6b7/0xa30
[ 2670.461709]  [<ffffffff816f758c>] oops_end+0x9c/0xe0
[ 2670.467649]  [<ffffffff81017fcb>] die+0x4b/0x70
[ 2670.473359]  [<ffffffff816f6e20>] do_trap+0x60/0x170
[ 2670.479445]  [<ffffffff810150aa>] do_invalid_op+0xaa/0xe0
[ 2670.485816]  [<ffffffffa00edfe4>] ? qlt_pci_map_calc_cnt+0x164/0x1a0
                [qla2xxx]
[ 2670.494264]  [<ffffffff8117e148>] ? __alloc_pages_nodemask+0x188/0xb60
[ 2670.501765]  [<ffffffff81700c9e>] invalid_op+0x1e/0x30
[ 2670.508219]  [<ffffffffa00edfe4>] ? qlt_pci_map_calc_cnt+0x164/0x1a0
                [qla2xxx]
[ 2670.516726]  [<ffffffffa00ef933>] qlt_rdy_to_xfer+0x93/0x210 [qla2xxx]
[ 2670.524462]  [<ffffffffa03c267c>] tcm_qla2xxx_write_pending+0x5c/0x60
                [tcm_qla2xxx]
[ 2670.533387]  [<ffffffffa04ff677>] transport_generic_new_cmd+0x127/0x1a0
                [target_core_mod]
[ 2670.542841]  [<ffffffffa04ff72b>] transport_handle_cdb_direct+0x3b/0x90
                [target_core_mod]
[ 2670.552307]  [<ffffffffa04ff8c2>] target_submit_cmd_map_sgls+0x142/0x280
                [target_core_mod]
[ 2670.561778]  [<ffffffffa04ffa59>] target_submit_cmd+0x59/0x60
                [target_core_mod]
[ 2670.570364]  [<ffffffffa03c25ea>] tcm_qla2xxx_handle_cmd+0x5a/0x90
                [tcm_qla2xxx]
[ 2670.578795]  [<ffffffffa00f0624>] qlt_do_work+0x2e4/0x460 [qla2xxx]
[ 2670.586343]  [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 2670.593454]  [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 2670.600058]  [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 2670.606910]  [<ffffffff810ae211>] kthread+0xe1/0x100
[ 2670.612885]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.620058]  [<ffffffff816ff4bc>] ret_from_fork+0x7c/0xb0
[ 2670.626578]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.633697] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 
66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 28 04 00 00 55 48 89 e5 
<48> 8b
40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 2670.656112] RIP  [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.662623]  RSP <ffff8800d3eeb880>
[ 2670.666936] CR2: ffffffffffffffd8
[ 2670.671170] ---[ end trace ec1db6cbc3980107 ]---
[ 2670.676804] Fixing recursive fault but reboot is needed!
[ 2691.628643] ------------[ cut here ]------------

Looking at protocol traces on a Fibre Channel Protocol Analyzer, a Report
LUNs command requesting 0x1008 bytes of data (FCP_DL=0x1008) appears to
trigger the crash.

It is a rare target that has 0x1008 (5004) bytes of LUN data to return.
What should happen is the actual data available should be returned and a
residual under flag be set in the response along with the residual under
field being set to the number of bytes not returned.  This system
actually is occasionally attached to a large storage array and
presumably that is why the large buffer is allocated for LUNs responses.

Since many of the Qla2xxx operations seem to pad (with zeros) the
responses out to whatever the requested data length was, maybe this is
overrunning a 4k buffer somewhere?  Note, the responses should not be
padding and returning more data than what is actually available but this
is what we are seeing on the analyzer.  The data length on the request
is specifying a maximum to return, not how much has to be returned.

So, this appears to be a combination of 2 problems.  1) Returns are
padding with zeros to the size of the request rather than truncating to
the size of actual data on hand. 2) Apparently a 4k buffer is being
overrun and crashing the system when a 0x1008 (4k + 8) Request for LUNs
command is received (this is speculation on my part).

Note the target system only has 1 LUN and should never come close to
returning 4k of data.  What we actually see are 2 response PDUs with 2k
(LEX=0x0800) of data.  This is mostly 0's, only the first 16-24 bytes
has information.  These are followed by a response PDU with 8 bytes (of
zeros).  The system then crashes hard.

We have traces of the initiator system connected to a Hitachi FC disk
drive where the same Report LUNs command (w/FCP_DL=0x1008) receives a
single response with 16 bytes of data (Len=0x0010).  The drive functions
fine after this.

Looking at another initiator system and our LIO target, we see Requests
for LUNs asking for approximately 2k of data and the LIO-based target
system responds without crashing but does return approximately 2k of
data (mostly zeros).

Please let me know if I can provide any more information to facilitate
getting this fixed.  Access to the systems is extremely limited though.

Thank you,

Craig Watson

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html