Re: qla2xxx crashing kernel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nicholas,

I'm sorry it has taken me so long to respond and follow up on this. I recently had shoulder surgery and have been under the weather. Doing better now.

Unfortunately I don't have direct access to the crashing system. It has taken almost a year just to get the protocol traces and console dump I have received.

The problem I am faced with is something is causing the kernel to crash and there aren't any zero-length commands after the Report LUNs command. Comparing 3 or 4 systems, the only anomaly we were able to identify in the crashing system was the large (and incorrect) response to the Report LUNs command. Other systems receiving less data (smaller requesting FCP_DL) in the response worked without problems. This is the only discernible difference in the traffic between a system that works and one that crashes. Everyone is telling me the BUG is caused by a zero-length command but I don't see one in the protocol trace. Since my quick and uninformed perusal of the code turned up a 4096 byte buffer, I theorized that a buffer overrun was possible. Is it possible that a buffer overrun is corrupting the logic and looking like a zero-length command? Probably not very likely.

I am not insisting this is a buffer overrun, but since there aren't any zero-length commands, I'm just asking for someone who understands the code to look at what is going on. Actually, since the 5004 byte response is broken into 3 packets that do get returned successfully, I'm not real confident of the buffer overrun theory. I just can't see a more likely scenario.

Note: Roland Dreier's fix (commit 2426bd456a target: Report correct response length for some commands) of the padding responses is not in the kernel version(s) we have tested. Sometime in the future we'll be able to roll out an update and see if it works. This might actually "short-circuit" my theorized buffer overrun and fix my immediate problem.

Here's the details from the protocol analyzer of the crash with a few comments thrown in. Note the only exchange after the large Report LUNs command does not appear to be a zero-length command.


At time index = 00:41.005_695_148 Port(1:1:1)/Side A, the Initiator sends our device, the LIO target unit, "Report Luns" with FC_DL = 0x00001008 or 5004 Bytes.

00:41.005_695_148 315.212 FC Port(1:1:1) FC4Cmd Report LUNs; LUNs with specific address methods; LUN = 0x0000; FCP_DL = 0x00001008; 68 0000EF 0000E8 0000 0005 FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 8B3E8E84 (Correct) EOFt(+)

Our device, the LIO target unit, responds by sending the available data for it's two LUNS (approx. 24 Bytes) and then fills the balance of the requested 5004 Bytes with zeros. The LIO target unit then terminates this exchange with "Good Status".

00:41.005_859_180 164.032 FC Port(1:1:2) FC4SData FC4SData; SCSI FCP; Offset = 0x00000000; Len = 0x0800; 2084 0000E8 0000EF 0005 0102 0000 Responder; RO; 00000000 58C3EBB3 (Correct) EOFn(-) 00:41.005_864_152 4.972 FC Port(1:1:2) FC4SData FC4SData; SCSI FCP; Offset = 0x00000800; Len = 0x0800; 2084 0000E8 0000EF 0005 0102 0001 Responder; RO; 00000800 557D64B9 (Correct) EOFn(-) 00:41.005_869_120 4.968 FC Port(1:1:2) FC4SData FC4SData; SCSI FCP; Offset = 0x00001000; Len = 0x0008; 44 0000E8 0000EF 0005 0102 0002 Responder; End_Sequence; RO; 00001000 767079BC (Correct) EOFt(+) 00:41.005_869_664 0.544 FC Port(1:1:2) FC4Status Good Status; 60 0000E8 0000EF 0005 0102 0000 Responder; Last_Sequence; End_Sequence; 00000000 021F63DE (Correct) EOFt(-) 00:41.006_107_404 237.740 FC Port(1:1:1) ExtLinkReq ExtLinkReq; EX_LNK_SRV; ADISC; 64 0000EF 0000E8 0006 FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 1C806BCC (Correct) EOFt(-) Interesting Event Found 00:41.006_214_424 107.020 FC Port(1:1:2) ExtLinkRply ExtLinkRply; EX_LNK_SRV; Accept; 64 0000E8 0000EF 0006 0006 0000 Responder; Last_Sequence; End_Sequence; 00000000 C2C18DC8 (Correct) EOFt(+)


At time index = 00:41.006_323_460 Port(1:1:1)/Side A, the Initiator sends our LIO target unit "Inquiry" Lun = 0x000 with FCP_DL = 0x00000090.

00:41.006_323_460 109.036 FC Port(1:1:1) FC4Cmd Inquiry; LUN = 0x0000; FCP_DL = 0x00000090; 68 0000EF 0000E8 0000 0007 FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 C10AA6D0 (Correct) EOFt(-)

Our device, the LIO target unit, responds with 0x0090 Bytes of Data. The LIO target unit then terminates this exchange with "Good Status".

00:41.006_450_404 126.944 FC Port(1:1:2) FC4SData FC4SData; SCSI FCP; Offset = 0x00000000; Len = 0x0090; 180 0000E8 0000EF 0007 0103 0000 Responder; End_Sequence; RO; 00000000 69B3CC1E (Correct) EOFt(+) 00:41.006_450_892 0.488 FC Port(1:1:2) FC4Status Good Status; 60 0000E8 0000EF 0007 0103 0000 Responder; Last_Sequence; End_Sequence; 00000000 8417ABA0 (Correct) EOFt(+)

After this exchange, the LIO target unit no longer responds to the initiator as shown by the trace lines below, because the Linux kernel has crashed.

Eventually the Initiator logs off the link (LOGO) at time index 00:52.254_510_372

00:41.006_592_808 141.916 FC Port(1:1:1) FC4Cmd Test Unit Ready; LUN = 0x0000; FCP_DL = 0x00000000; 68 0000EF 0000E8 0000 0008 FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 1EDFA836 (Correct) EOFt(-) 00:44.191_823_484 3185230.676 FC Port(1:1:1) ABTS ABTS; Basic Link Service; Abort Exchange; 36 0000EF 0000E8 0008 FFFF 0000 Originator; End_Sequence; Transfer Sequence Initiative; EA3CE37F (Correct) EOFt(+) Interesting Event Found 00:44.191_897_724 74.240 FC Port(1:1:1) FC4Cmd Test Unit Ready; LUN = 0x0000; FCP_DL = 0x00000000; 68 0000EF 0000E8 0000 0009 FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 D5EC0952 (Correct) EOFt(-) 00:47.416_231_268 3224333.544 FC Port(1:1:1) ABTS ABTS; Basic Link Service; Abort Exchange; 36 0000EF 0000E8 0009 FFFF 0000 Originator; End_Sequence; Transfer Sequence Initiative; 5E3794D9 (Correct) EOFt(-) Interesting Event Found 00:47.416_490_052 258.784 FC Port(1:1:1) FC4Cmd Inquiry; LUN = 0x0000; FCP_DL = 0x00000090; 68 0000EF 0000E8 0000 000A FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 ECA729E4 (Correct) EOFt(-) 00:48.222_328_208 805838.156 FC Port(1:1:1) ABTS ABTS; Basic Link Service; Abort Exchange; 36 0000EF 0000E8 0008 FFFF 0000 Originator; End_Sequence; Transfer Sequence Initiative; EA3CE37F (Correct) EOFt(+) Interesting Event Found 00:50.640_639_416 2418311.208 FC Port(1:1:1) ABTS ABTS; Basic Link Service; Abort Exchange; 36 0000EF 0000E8 000A FFFF 0000 Originator; End_Sequence; Transfer Sequence Initiative; C32D7CE8 (Correct) EOFt(+) Interesting Event Found 00:50.650_054_092 9414.676 FC Port(1:1:1) FC4Cmd Inquiry; LUN = 0x0000; FCP_DL = 0x00000090; 68 0000EF 0000E8 0000 000B FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 27948880 (Correct) EOFt(-) 00:51.446_734_260 796680.168 FC Port(1:1:1) ABTS ABTS; Basic Link Service; Abort Exchange; 36 0000EF 0000E8 0009 FFFF 0000 Originator; End_Sequence; Transfer Sequence Initiative; 5E3794D9 (Correct) EOFt(-) Interesting Event Found 00:52.254_510_372 807776.112 FC Port(1:1:1) ExtLinkReq ExtLinkReq; EX_LNK_SRV; LOGO; 52 0000EF 0000E8 000C FFFF 0000 Originator; First_Sequence; End_Sequence; Transfer Sequence Initiative; 00000000 755F5787 (Correct) EOFt(-) Interesting Event Found



On 10/03/2014 05:04 PM, Nicholas A. Bellinger wrote:
Hi Craig,

A quick update on this bug.  Comments below.

On Wed, 2014-09-17 at 17:34 -0700, Nicholas A. Bellinger wrote:
Hi Craig,

Thanks for the detailed bug report.  Comments are inline below.

Also, adding some QLogic folks on the CC'..

On Fri, 2014-09-05 at 19:53 -0400, Craig Watson wrote:
Hi Nicholas,

One of our systems is seeing a kernel BUG with a fibre channel target
system.  The system is an Intel i7 with a qle2562 FC HBA running FC20
with kernel 3.14.4-200.  (We have seen this with several older kernel
versions too).

The LIO component versions are:
      targetcli    2.1.fb35-1
      rtslib       2.1.fb47-1
      configshell  1.1.fb13-1

The bug/crash is repeatable.  It happens every time the LIO target
system is attached to the initiator.  The target is connected to an
older VxWorks initiator system.  This system has been in service for
years and works with several models of FC disk drives.  The new
LIO-based system is an attempt to replace aging hardware with newer,
more reliable, and higher performance storage.

<nod>, thanks for the extra background on your setup.

Here is the console output of the kernel BUG:

[ 2669.734186] ------------[ cut here ]------------
[ 2669.738893] kernel BUG at drivers/scsi/qla2xxx/qla_target.c:1343!
[ 2669.745080] invalid opcode: 0000 [#1] SMP
[ 2669.749275] Modules linked in: tcm_qla2xxx target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
cfg80211 rfkill snd_hda_codec_realtek snd_hda_codec_generic
snd_hda_intel snd_hda_codec serio_raw snd_hwdep snd_pcm snd_timer
x86_pkg_temp_thermal snd e1000e ptp iTCO_wdt mei_me iTCO_vendor_support
mei lpc_ich mfd_core raid0 i2c_i801 shpchp pps_core soundcore coretemp
kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel microcode usb_storage ata_generic pata_acpi i915
pata_jmicron qla2xxx i2c_algo_bit drm_kms_helper drm scsi_transport_fc
scsi_tgt i2c_core video
[ 2669.805324] CPU: 1 PID: 558 Comm: kworker/1:0 Not tainted 3.14.4-200.fc20.x86_64 #1
[ 2669.813052] Hardware name: Intel Corporation ChiefRiver Platform/PCOM-B219VG, BIOS R1.00.W3 09/24/2012
[ 2669.822485] Workqueue: qla_tgt_wq qlt_do_work [qla2xxx]
[ 2669.827810] task: ffff8800d28f1c80 ti: ffff8800d3eea000 task.ti: ffff8800d3eea000
[ 2669.835399] RIP: 0010:[<ffffffffa00edfe4>] [<ffffffffa00edfe4>] qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2669.845308] RSP: 0018:ffff8800d3eebba8  EFLAGS: 00010246
[ 2669.850712] RAX: ffff880117d3b400 RBX: ffff880036b9d6f0 RCX: ffffffffa010a19a
[ 2669.857957] RDX: 0000000000000000 RSI: ffff880036b9d6f0 RDI: ffff8800d3eebc00
[ 2669.865190] RBP: ffff8800d3eebbe0 R08: 0000000000000000 R09: 0000000000000020
[ 2669.872418] R10: 0000000000000002 R11: 0000000000000000 R12: ffff8800d3eebc00
[ 2669.879646] R13: ffff8800d3eebc00 R14: ffff8800d1e30000 R15: ffff880117d3
[ 2669.886897] FS:  0000000000000000(0000) GS:ffff88011e240000(0000) knlGS:0000000000000000
[ 2669.895102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2669.900937] CR2: 00007f5efb75b000 CR3: 0000000001c0c000 CR4: 00000000001407e0
[ 2669.908144] Stack:
[ 2669.910187]  000000006acfe608 0000000100000000 ffff880036b9d6f0 ffff880117d3b400
[ 2669.917797]  ffff8800d3eebc00 ffff8800d1e30000 ffff880117d3b72c ffff8800d3eebc78
[ 2669.925387]  ffffffffa00ef933 0000000000000000 0000000000000000 ffff880117d3b400
[ 2669.932969] Call Trace:
[ 2669.935455]  [<ffffffffa00ef933>] qlt_rdy_to_xfer+0x93/0x210 [qla2xxx]
[ 2669.942070]  [<ffffffffa03c267c>] tcm_qla2xxx_write_pending+0x5c/0x60 [tcm_qla2xxx]
[ 2669.949855]  [<ffffffffa04ff677>] transport_generic_new_cmd+0x127/0x1a0 [target_core_mod]
[ 2669.958184]  [<ffffffffa04ff72b>] transport_handle_cdb_direct+0x3b/0x90 [target_core_mod]
[ 2669.966510]  [<ffffffffa04ff8c2>] target_submit_cmd_map_sgls+0x142/0x280 [target_core_mod]
[ 2669.974917]  [<ffffffffa04ffa59>] target_submit_cmd+0x59/0x60 [target_core_mod]
[ 2669.982336]  [<ffffffffa03c25ea>] tcm_qla2xxx_handle_cmd+0x5a/0x90 [tcm_qla2xxx]
[ 2669.989847]  [<ffffffffa00f0624>] qlt_do_work+0x2e4/0x460 [qla2xxx]
[ 2669.996231]  [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 2670.002163]  [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 2670.007809]  [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 2670.013710]  [<ffffffff810ae211>] kthread+0xe1/0x100
[ 2670.018763]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.024943]  [<ffffffff816ff4bc>] ret_from_fork+0x7c/0xb0
[ 2670.031673]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.039056] Code: 0f 1f 80 00 00 00 00 45 8b 4c 24 24 eb be 66 0f 1f 84 00
                       00 00 00 00 48 c7 45 c8 00 00 00 00 4c 8b 3d 01 a6 b2 e1
                       e9 18 ff ff ff <0f> 0b 48 8b b0 f8 02 00 00 44 8b 88 d0
                       02 00 00 45 31 c0 31 c0
[ 2670.062304] RIP  [<ffffffffa00edfe4>] qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2670.071041]  RSP <ffff8800d3eebba8>
[ 2670.201729] ---[ end trace ec1db6cbc3980106 ]---
[ 2670.207583] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 2670.215756] IP: [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.222563] PGD 1c0f067 PUD 1c11067 PMD 0
[ 2670.227968] Oops: 0000 [#2] SMP
[ 2670.232458] Modules linked in: tcm_qla2xxx target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
cfg80211
rfkill snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
snd_hda_codec
serio_raw snd_hwdep snd_pcm snd_timer x86_pkg_temp_thermal snd e1000e ptp
iTCO_wdt mei_me iTCO_vendor_support mei lpc_ich mfd_core raid0 i2c_i801
shpchp
pps_core soundcore coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul
crc32c_intel ghash_clmulni_intel microcode usb_storage ata_generic pata_acpi
i915 pata_jmicron qla2xxx i2c_algo_bit drm_kms_helper drm scsi_transport_fc
scsi_tgt i2c_core video
[ 2670.293550] CPU: 1 PID: 558 Comm: kworker/1:0 Tainted: G D 3.14.4-200.fc20.x86_64 #1
[ 2670.303354] Hardware name: Intel Corporation ChiefRiver Platform/PCOM-B219VG,BIOS R1.00.W3 09/24/2012
[ 2670.313751] task: ffff8800d28f1c80 ti: ffff8800d3eea000 task.ti: ffff8800d3eea000
[ 2670.322257] RIP: 0010:[<ffffffff810ae800>] [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.331302] RSP: 0018:ffff8800d3eeb880  EFLAGS: 00010002
[ 2670.337864] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000000000f
[ 2670.346306] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d28f1c80
[ 2670.354446] RBP: ffff8800d3eeb880 R08: ffff8800d28f1d10 R09: 00000001800f000c
[ 2670.362579] R10: ffffffff8108760c R11: ffffea00034fc200 R12: ffff88011e254640
[ 2670.370972] R13: 0000000000000001 R14: ffff8800d28f1c70 R15: ffff8800d28f1c80
[ 2670.379385] FS:  0000000000000000(0000) GS:ffff88011e240000(0000) knlGS:0000000000000000
[ 2670.388826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2670.395561] CR2: 0000000000000028 CR3: 0000000001c0c000 CR4: 00000000001407e0
[ 2670.403681] Stack:
[ 2670.406688]  ffff8800d3eeb898 ffffffff810a7b91 ffff8800d28f1c80 ffff8800d3eeb8f8
[ 2670.415307]  ffffffff816f2b89 ffff8800d28f1c80 ffff8800d3eebfd8 0000000000014640
[ 2670.424064]  0000000000014640 ffff8800d28f1c80 ffff8800d28f2318 ffff8800d3eeb5d0
[ 2670.432560] Call Trace:
[ 2670.436195]  [<ffffffff810a7b91>] wq_worker_sleeping+0x11/0x90
[ 2670.443065]  [<ffffffff816f2b89>] __schedule+0x4a9/0x740
[ 2670.449512]  [<ffffffff816f2e49>] schedule+0x29/0x70
[ 2670.455693]  [<ffffffff8108cb47>] do_exit+0x6b7/0xa30
[ 2670.461709]  [<ffffffff816f758c>] oops_end+0x9c/0xe0
[ 2670.467649]  [<ffffffff81017fcb>] die+0x4b/0x70
[ 2670.473359]  [<ffffffff816f6e20>] do_trap+0x60/0x170
[ 2670.479445]  [<ffffffff810150aa>] do_invalid_op+0xaa/0xe0
[ 2670.485816]  [<ffffffffa00edfe4>] ? qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2670.494264]  [<ffffffff8117e148>] ? __alloc_pages_nodemask+0x188/0xb60
[ 2670.501765]  [<ffffffff81700c9e>] invalid_op+0x1e/0x30
[ 2670.508219]  [<ffffffffa00edfe4>] ? qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2670.516726]  [<ffffffffa00ef933>] qlt_rdy_to_xfer+0x93/0x210 [qla2xxx]
[ 2670.524462]  [<ffffffffa03c267c>] tcm_qla2xxx_write_pending+0x5c/0x60 [tcm_qla2xxx]
[ 2670.533387]  [<ffffffffa04ff677>] transport_generic_new_cmd+0x127/0x1a0 [target_core_mod]
[ 2670.542841]  [<ffffffffa04ff72b>] transport_handle_cdb_direct+0x3b/0x90 [target_core_mod]
[ 2670.552307]  [<ffffffffa04ff8c2>] target_submit_cmd_map_sgls+0x142/0x280 [target_core_mod]
[ 2670.561778]  [<ffffffffa04ffa59>] target_submit_cmd+0x59/0x60 [target_core_mod]
[ 2670.570364]  [<ffffffffa03c25ea>] tcm_qla2xxx_handle_cmd+0x5a/0x90 [tcm_qla2xxx]
[ 2670.578795]  [<ffffffffa00f0624>] qlt_do_work+0x2e4/0x460 [qla2xxx]
[ 2670.586343]  [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 2670.593454]  [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 2670.600058]  [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 2670.606910]  [<ffffffff810ae211>] kthread+0xe1/0x100
[ 2670.612885]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.620058]  [<ffffffff816ff4bc>] ret_from_fork+0x7c/0xb0
[ 2670.626578]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.633697] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 28 04 00 00 55 48 89 e5 <48> 8b
40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 2670.656112] RIP  [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.662623]  RSP <ffff8800d3eeb880>
[ 2670.666936] CR2: ffffffffffffffd8
[ 2670.671170] ---[ end trace ec1db6cbc3980107 ]---
[ 2670.676804] Fixing recursive fault but reboot is needed!
[ 2691.628643] ------------[ cut here ]------------


Looking at protocol traces on a Fibre Channel Protocol Analyzer, a Report
LUNs command requesting 0x1008 bytes of data (FCP_DL=0x1008) appears to
trigger the crash.

It is a rare target that has 0x1008 (5004) bytes of LUN data to return.
What should happen is the actual data available should be returned and a
residual under flag be set in the response along with the residual under
field being set to the number of bytes not returned.  This system
actually is occasionally attached to a large storage array and
presumably that is why the large buffer is allocated for LUNs responses.

AFAICT, the 'what should happen' is the current behavior in target core.

That is, the REPORT_LUNS response sets the underflow bit and associated
residual length in target_complete_cmd_with_length() when the outgoing
REPORT_LUNS payload is smaller than the received SCSI EDTL.

Since many of the Qla2xxx operations seem to pad (with zeros) the
responses out to whatever the requested data length was, maybe this is
overrunning a 4k buffer somewhere?  Note, the responses should not be
padding and returning more data than what is actually available but this
is what we are seeing on the analyzer.  The data length on the request
is specifying a maximum to return, not how much has to be returned.

Ok, based on the comments below, I think this is a separate issue that
is specific to qla2xxx fabric, but is not the cause of the above
BUG_ON().

So, this appears to be a combination of 2 problems.  1) Returns are
padding with zeros to the size of the request rather than truncating to
the size of actual data on hand. 2) Apparently a 4k buffer is being
overrun and crashing the system when a 0x1008 (4k + 8) Request for LUNs
command is received (this is speculation on my part).

So based upon the above trace, the BUG_ON() being triggered is in the
WRITE code-path and doesn't appear related to a specific REPORT_LUNS
payload size.

Namely, there is a check in qlt_pci_map_calc_cnt() that ensures that a
non zero scatterlist count exists when a given command's scatterlists
are mapped into PCI address space for receiving the incoming WRITE
payload.

I'm thinking this could very well be some form of zero-length WRITE that
is occurring immediately after the REPORT_LUNs.

Is anything else showing up on the trace after the REPORT_LUNs..?

Note the target system only has 1 LUN and should never come close to
returning 4k of data.  What we actually see are 2 response PDUs with 2k
(LEX=0x0800) of data.  This is mostly 0's, only the first 16-24 bytes
has information.  These are followed by a response PDU with 8 bytes (of
zeros).  The system then crashes hard.

We have traces of the initiator system connected to a Hitachi FC disk
drive where the same Report LUNs command (w/FCP_DL=0x1008) receives a
single response with 16 bytes of data (Len=0x0010).  The drive functions
fine after this.

Looking at another initiator system and our LIO target, we see Requests
for LUNs asking for approximately 2k of data and the LIO-based target
system responds without crashing but does return approximately 2k of
data (mostly zeros).

Please let me know if I can provide any more information to facilitate
getting this fixed.  Access to the systems is extremely limited though.

Thanks.  I need to verify zero-length WRITEs work as expected first, and
then will take a look at why REPORT_LUNS is returning the extra payload
with qla2xxx.

OK, I've been able to verify with v3.14.13 + qla2xxx export that
zero-length WRITEs and residual handling for REPORT_LUNS with a 5004
byte transfer length are working as expected.

# sg_raw /dev/sg9 2a 00 00 00 00 00 00 00 00 00
SCSI Status: Good

# sg_luns -m 5004 -v /dev/sdj
     report luns cdb: a0 00 00 00 00 00 00 00 13 8c 00 00
     report luns: pass-through requested 5004 bytes but got 72 bytes
Lun list length = 8 which imples 1 lun entry
Report luns [select_report=0x0]:
     0000000000000000

Looking at the v3.14.y change-log, Roland's patch below to fix the
residual handling went in for v3.14.9, and should address the zero
padding issue you observed with the FC analyzer.

target: Report correct response length for some commands
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.14.y&id=9a7f545e75cef29dbdc4c585f3ae27c5f7f4ea48

It's still unclear if this is what triggered the BUG_ON() in
qlt_pci_map_calc_cnt() you originally reported, but at this point it
would be a good idea to retest with >= v3.14.13, preferably with extra
debug code in place to dump se_cmd if the same BUG_ON() triggers again.

Thanks,

--nab


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux