Re: qla2xxx crashing kernel

Craig Watson <craig.watson@xxxxxxxxxxxxxxxxxxx> · Mon, 13 Oct 2014 00:37:39 -0400

Hi Nicholas,

I'm sorry it has taken me so long to respond and follow up on this.  I 
recently had shoulder surgery and have been under the weather.  Doing 
better now.

Unfortunately I don't have direct access to the crashing system. It has 
taken almost a year just to get the protocol traces and console dump I 
have received.

The problem I am faced with is something is causing the kernel to crash 
and there aren't any zero-length commands after the Report LUNs 
command.  Comparing 3 or 4 systems, the only anomaly we were able to 
identify in the crashing system was the large (and incorrect) response 
to the Report LUNs command.  Other systems receiving less data (smaller 
requesting FCP_DL) in the response worked without problems.  This is the 
only discernible difference in the traffic between a system that works 
and one that crashes. Everyone is telling me the BUG is caused by a 
zero-length command but I don't see one in the protocol trace.  Since my 
quick and uninformed perusal of the code turned up a 4096 byte buffer, I 
theorized that a buffer overrun was possible.  Is it possible that a 
buffer overrun is corrupting the logic and looking like a zero-length 
command? Probably not very likely.

I am not insisting this is a buffer overrun, but since there aren't any 
zero-length commands, I'm just asking for someone who understands the 
code to look at what is going on.  Actually, since the 5004 byte 
response is broken into 3 packets that do get returned successfully, I'm 
not real confident of the buffer overrun theory.  I just can't see a 
more likely scenario.

Note: Roland Dreier's fix (commit 2426bd456a target: Report correct 
response length for some commands) of the padding responses is not in 
the kernel version(s) we have tested. Sometime in the future we'll be 
able to roll out an update and see if it works.  This might actually 
"short-circuit" my theorized buffer overrun and fix my immediate problem.

Here's the details from the protocol analyzer of the crash with a few 
comments thrown in.  Note the only exchange after the large Report LUNs 
command does not appear to be a zero-length command.

At time index = 00:41.005_695_148 Port(1:1:1)/Side A, the Initiator 
sends our device, the LIO target unit, "Report Luns" with FC_DL = 
0x00001008 or 5004 Bytes.

       00:41.005_695_148    315.212        FC Port(1:1:1) FC4Cmd        
        Report LUNs; LUNs with specific address methods; LUN = 0x0000; 
FCP_DL = 0x00001008;     68    0000EF 0000E8    0000    0005    FFFF    
0000    Originator; First_Sequence; End_Sequence; Transfer Sequence 
Initiative; 00000000            8B3E8E84    (Correct)    EOFt(+)

Our device, the LIO target unit, responds by sending the available data 
for it's two LUNS (approx. 24 Bytes) and then fills the balance of the 
requested 5004 Bytes with zeros. The LIO target unit then terminates 
this exchange with "Good Status".

       00:41.005_859_180    164.032        FC Port(1:1:2)     
FC4SData    FC4SData; SCSI FCP; Offset = 0x00000000; Len = 0x0800;    
     2084    0000E8    0000EF        0005    0102 0000    Responder; 
RO;     00000000            58C3EBB3 (Correct)    EOFn(-)
       00:41.005_864_152      4.972        FC Port(1:1:2)     
FC4SData    FC4SData; SCSI FCP; Offset = 0x00000800; Len = 0x0800;    
     2084    0000E8    0000EF        0005    0102 0001    Responder; 
RO;     00000800            557D64B9 (Correct)    EOFn(-)
       00:41.005_869_120      4.968        FC Port(1:1:2)     
FC4SData    FC4SData; SCSI FCP; Offset = 0x00001000; Len = 0x0008;    
       44    0000E8    0000EF        0005    0102 0002    Responder; 
End_Sequence; RO;     00001000 767079BC    (Correct)    EOFt(+)
       00:41.005_869_664      0.544        FC Port(1:1:2)     
FC4Status   Good Status;                 60    0000E8    0000EF        
0005    0102 0000    Responder; Last_Sequence; End_Sequence;     
00000000         021F63DE    (Correct)    EOFt(-)
       00:41.006_107_404    237.740        FC Port(1:1:1) ExtLinkReq    
        ExtLinkReq; EX_LNK_SRV; ADISC;                     64    
0000EF    0000E8        0006    FFFF 0000    Originator; First_Sequence; 
End_Sequence; Transfer Sequence Initiative;     00000000            
1C806BCC (Correct)    EOFt(-)    Interesting Event Found
       00:41.006_214_424    107.020        FC Port(1:1:2)     
ExtLinkRply    ExtLinkRply; EX_LNK_SRV; Accept;                 64    
0000E8    0000EF        0006    0006 0000    Responder; Last_Sequence; 
End_Sequence;     00000000         C2C18DC8    (Correct)    EOFt(+)

At time index = 00:41.006_323_460 Port(1:1:1)/Side A, the Initiator 
sends our LIO target unit "Inquiry" Lun = 0x000 with FCP_DL = 0x00000090.

       00:41.006_323_460    109.036        FC Port(1:1:1) FC4Cmd        
        Inquiry; LUN = 0x0000; FCP_DL = 0x00000090;                    
68    0000EF    0000E8    0000    0007 FFFF    0000    Originator; 
First_Sequence; End_Sequence; Transfer Sequence Initiative;     
00000000            C10AA6D0 (Correct)    EOFt(-)

Our device, the LIO target unit, responds with 0x0090 Bytes of Data. The 
LIO target unit then terminates this exchange with "Good Status".

       00:41.006_450_404    126.944        FC Port(1:1:2)     
FC4SData    FC4SData; SCSI FCP; Offset = 0x00000000; Len = 0x0090;    
      180    0000E8    0000EF        0007    0103 0000    Responder; 
End_Sequence; RO;     00000000 69B3CC1E    (Correct)    EOFt(+)
       00:41.006_450_892      0.488        FC Port(1:1:2)     
FC4Status    Good Status;                60    0000E8    0000EF        
0007    0103 0000    Responder; Last_Sequence; End_Sequence;     
00000000         8417ABA0    (Correct)    EOFt(+)

After this exchange, the LIO target unit no longer responds to the 
initiator as shown by the trace lines below, because the Linux kernel 
has crashed.

Eventually the Initiator logs off the link (LOGO) at time index 
00:52.254_510_372

       00:41.006_592_808    141.916        FC Port(1:1:1) FC4Cmd        
        Test Unit Ready; LUN = 0x0000; FCP_DL = 0x00000000;            
68    0000EF    0000E8    0000    0008 FFFF    0000    Originator; 
First_Sequence; End_Sequence; Transfer Sequence Initiative;     
00000000            1EDFA836 (Correct)    EOFt(-)
       00:44.191_823_484    3185230.676    FC Port(1:1:1) ABTS        
        ABTS; Basic Link Service; Abort Exchange;                    
36    0000EF    0000E8        0008    FFFF 0000    Originator; 
End_Sequence; Transfer Sequence Initiative;                 EA3CE37F    
(Correct)    EOFt(+)    Interesting Event Found
       00:44.191_897_724     74.240        FC Port(1:1:1) FC4Cmd        
        Test Unit Ready; LUN = 0x0000; FCP_DL = 0x00000000;            
68    0000EF    0000E8    0000    0009 FFFF    0000    Originator; 
First_Sequence; End_Sequence; Transfer Sequence Initiative;     
00000000            D5EC0952 (Correct)    EOFt(-)
       00:47.416_231_268    3224333.544    FC Port(1:1:1) ABTS        
        ABTS; Basic Link Service; Abort Exchange;                    
36    0000EF    0000E8        0009    FFFF 0000    Originator; 
End_Sequence; Transfer Sequence Initiative;                 5E3794D9    
(Correct)    EOFt(-)    Interesting Event Found
       00:47.416_490_052    258.784        FC Port(1:1:1) FC4Cmd        
        Inquiry; LUN = 0x0000; FCP_DL = 0x00000090;                    
68    0000EF    0000E8    0000    000A FFFF    0000    Originator; 
First_Sequence; End_Sequence; Transfer Sequence Initiative;     
00000000            ECA729E4 (Correct)    EOFt(-)
       00:48.222_328_208    805838.156     FC Port(1:1:1) ABTS        
        ABTS; Basic Link Service; Abort Exchange;                    
36    0000EF    0000E8        0008    FFFF 0000    Originator; 
End_Sequence; Transfer Sequence Initiative;                 EA3CE37F    
(Correct)    EOFt(+)    Interesting Event Found
       00:50.640_639_416    2418311.208    FC Port(1:1:1) ABTS        
        ABTS; Basic Link Service; Abort Exchange;                    
36    0000EF    0000E8        000A    FFFF 0000    Originator; 
End_Sequence; Transfer Sequence Initiative;                 C32D7CE8    
(Correct)    EOFt(+)    Interesting Event Found
       00:50.650_054_092    9414.676       FC Port(1:1:1) FC4Cmd        
        Inquiry; LUN = 0x0000; FCP_DL = 0x00000090;                    
68    0000EF    0000E8    0000    000B FFFF    0000    Originator; 
First_Sequence; End_Sequence; Transfer Sequence Initiative;     
00000000            27948880 (Correct)    EOFt(-)
       00:51.446_734_260    796680.168     FC Port(1:1:1) ABTS        
        ABTS; Basic Link Service; Abort Exchange;                    
36    0000EF    0000E8        0009    FFFF 0000    Originator; 
End_Sequence; Transfer Sequence Initiative;                 5E3794D9    
(Correct)    EOFt(-)    Interesting Event Found
       00:52.254_510_372    807776.112     FC Port(1:1:1) ExtLinkReq    
        ExtLinkReq; EX_LNK_SRV; LOGO;                      52    
0000EF    0000E8        000C    FFFF 0000    Originator; First_Sequence; 
End_Sequence; Transfer Sequence Initiative;     00000000            
755F5787 (Correct)    EOFt(-)    Interesting Event Found

On 10/03/2014 05:04 PM, Nicholas A. Bellinger wrote:
Hi Craig,

A quick update on this bug.  Comments below.

On Wed, 2014-09-17 at 17:34 -0700, Nicholas A. Bellinger wrote:
Hi Craig,

Thanks for the detailed bug report.  Comments are inline below.

Also, adding some QLogic folks on the CC'..

On Fri, 2014-09-05 at 19:53 -0400, Craig Watson wrote:
Hi Nicholas,

One of our systems is seeing a kernel BUG with a fibre channel target
system.  The system is an Intel i7 with a qle2562 FC HBA running FC20
with kernel 3.14.4-200.  (We have seen this with several older kernel
versions too).

The LIO component versions are:
      targetcli    2.1.fb35-1
      rtslib       2.1.fb47-1
      configshell  1.1.fb13-1

The bug/crash is repeatable.  It happens every time the LIO target
system is attached to the initiator.  The target is connected to an
older VxWorks initiator system.  This system has been in service for
years and works with several models of FC disk drives.  The new
LIO-based system is an attempt to replace aging hardware with newer,
more reliable, and higher performance storage.

<nod>, thanks for the extra background on your setup.

Here is the console output of the kernel BUG:

[ 2669.734186] ------------[ cut here ]------------
[ 2669.738893] kernel BUG at drivers/scsi/qla2xxx/qla_target.c:1343!
[ 2669.745080] invalid opcode: 0000 [#1] SMP
[ 2669.749275] Modules linked in: tcm_qla2xxx target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
cfg80211 rfkill snd_hda_codec_realtek snd_hda_codec_generic
snd_hda_intel snd_hda_codec serio_raw snd_hwdep snd_pcm snd_timer
x86_pkg_temp_thermal snd e1000e ptp iTCO_wdt mei_me iTCO_vendor_support
mei lpc_ich mfd_core raid0 i2c_i801 shpchp pps_core soundcore coretemp
kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel microcode usb_storage ata_generic pata_acpi i915
pata_jmicron qla2xxx i2c_algo_bit drm_kms_helper drm scsi_transport_fc
scsi_tgt i2c_core video
[ 2669.805324] CPU: 1 PID: 558 Comm: kworker/1:0 Not tainted 3.14.4-200.fc20.x86_64 #1
[ 2669.813052] Hardware name: Intel Corporation ChiefRiver Platform/PCOM-B219VG, BIOS R1.00.W3 09/24/2012
[ 2669.822485] Workqueue: qla_tgt_wq qlt_do_work [qla2xxx]
[ 2669.827810] task: ffff8800d28f1c80 ti: ffff8800d3eea000 task.ti: ffff8800d3eea000
[ 2669.835399] RIP: 0010:[<ffffffffa00edfe4>] [<ffffffffa00edfe4>] qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2669.845308] RSP: 0018:ffff8800d3eebba8  EFLAGS: 00010246
[ 2669.850712] RAX: ffff880117d3b400 RBX: ffff880036b9d6f0 RCX: ffffffffa010a19a
[ 2669.857957] RDX: 0000000000000000 RSI: ffff880036b9d6f0 RDI: ffff8800d3eebc00
[ 2669.865190] RBP: ffff8800d3eebbe0 R08: 0000000000000000 R09: 0000000000000020
[ 2669.872418] R10: 0000000000000002 R11: 0000000000000000 R12: ffff8800d3eebc00
[ 2669.879646] R13: ffff8800d3eebc00 R14: ffff8800d1e30000 R15: ffff880117d3
[ 2669.886897] FS:  0000000000000000(0000) GS:ffff88011e240000(0000) knlGS:0000000000000000
[ 2669.895102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2669.900937] CR2: 00007f5efb75b000 CR3: 0000000001c0c000 CR4: 00000000001407e0
[ 2669.908144] Stack:
[ 2669.910187]  000000006acfe608 0000000100000000 ffff880036b9d6f0 ffff880117d3b400
[ 2669.917797]  ffff8800d3eebc00 ffff8800d1e30000 ffff880117d3b72c ffff8800d3eebc78
[ 2669.925387]  ffffffffa00ef933 0000000000000000 0000000000000000 ffff880117d3b400
[ 2669.932969] Call Trace:
[ 2669.935455]  [<ffffffffa00ef933>] qlt_rdy_to_xfer+0x93/0x210 [qla2xxx]
[ 2669.942070]  [<ffffffffa03c267c>] tcm_qla2xxx_write_pending+0x5c/0x60 [tcm_qla2xxx]
[ 2669.949855]  [<ffffffffa04ff677>] transport_generic_new_cmd+0x127/0x1a0 [target_core_mod]
[ 2669.958184]  [<ffffffffa04ff72b>] transport_handle_cdb_direct+0x3b/0x90 [target_core_mod]
[ 2669.966510]  [<ffffffffa04ff8c2>] target_submit_cmd_map_sgls+0x142/0x280 [target_core_mod]
[ 2669.974917]  [<ffffffffa04ffa59>] target_submit_cmd+0x59/0x60 [target_core_mod]
[ 2669.982336]  [<ffffffffa03c25ea>] tcm_qla2xxx_handle_cmd+0x5a/0x90 [tcm_qla2xxx]
[ 2669.989847]  [<ffffffffa00f0624>] qlt_do_work+0x2e4/0x460 [qla2xxx]
[ 2669.996231]  [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 2670.002163]  [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 2670.007809]  [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 2670.013710]  [<ffffffff810ae211>] kthread+0xe1/0x100
[ 2670.018763]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.024943]  [<ffffffff816ff4bc>] ret_from_fork+0x7c/0xb0
[ 2670.031673]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.039056] Code: 0f 1f 80 00 00 00 00 45 8b 4c 24 24 eb be 66 0f 1f 84 00
                       00 00 00 00 48 c7 45 c8 00 00 00 00 4c 8b 3d 01 a6 b2 e1
                       e9 18 ff ff ff <0f> 0b 48 8b b0 f8 02 00 00 44 8b 88 d0
                       02 00 00 45 31 c0 31 c0
[ 2670.062304] RIP  [<ffffffffa00edfe4>] qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2670.071041]  RSP <ffff8800d3eebba8>
[ 2670.201729] ---[ end trace ec1db6cbc3980106 ]---
[ 2670.207583] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 2670.215756] IP: [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.222563] PGD 1c0f067 PUD 1c11067 PMD 0
[ 2670.227968] Oops: 0000 [#2] SMP
[ 2670.232458] Modules linked in: tcm_qla2xxx target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
cfg80211
rfkill snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
snd_hda_codec
serio_raw snd_hwdep snd_pcm snd_timer x86_pkg_temp_thermal snd e1000e ptp
iTCO_wdt mei_me iTCO_vendor_support mei lpc_ich mfd_core raid0 i2c_i801
shpchp
pps_core soundcore coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul
crc32c_intel ghash_clmulni_intel microcode usb_storage ata_generic pata_acpi
i915 pata_jmicron qla2xxx i2c_algo_bit drm_kms_helper drm scsi_transport_fc
scsi_tgt i2c_core video
[ 2670.293550] CPU: 1 PID: 558 Comm: kworker/1:0 Tainted: G D 3.14.4-200.fc20.x86_64 #1
[ 2670.303354] Hardware name: Intel Corporation ChiefRiver Platform/PCOM-B219VG,BIOS R1.00.W3 09/24/2012
[ 2670.313751] task: ffff8800d28f1c80 ti: ffff8800d3eea000 task.ti: ffff8800d3eea000
[ 2670.322257] RIP: 0010:[<ffffffff810ae800>] [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.331302] RSP: 0018:ffff8800d3eeb880  EFLAGS: 00010002
[ 2670.337864] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000000000f
[ 2670.346306] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d28f1c80
[ 2670.354446] RBP: ffff8800d3eeb880 R08: ffff8800d28f1d10 R09: 00000001800f000c
[ 2670.362579] R10: ffffffff8108760c R11: ffffea00034fc200 R12: ffff88011e254640
[ 2670.370972] R13: 0000000000000001 R14: ffff8800d28f1c70 R15: ffff8800d28f1c80
[ 2670.379385] FS:  0000000000000000(0000) GS:ffff88011e240000(0000) knlGS:0000000000000000
[ 2670.388826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2670.395561] CR2: 0000000000000028 CR3: 0000000001c0c000 CR4: 00000000001407e0
[ 2670.403681] Stack:
[ 2670.406688]  ffff8800d3eeb898 ffffffff810a7b91 ffff8800d28f1c80 ffff8800d3eeb8f8
[ 2670.415307]  ffffffff816f2b89 ffff8800d28f1c80 ffff8800d3eebfd8 0000000000014640
[ 2670.424064]  0000000000014640 ffff8800d28f1c80 ffff8800d28f2318 ffff8800d3eeb5d0
[ 2670.432560] Call Trace:
[ 2670.436195]  [<ffffffff810a7b91>] wq_worker_sleeping+0x11/0x90
[ 2670.443065]  [<ffffffff816f2b89>] __schedule+0x4a9/0x740
[ 2670.449512]  [<ffffffff816f2e49>] schedule+0x29/0x70
[ 2670.455693]  [<ffffffff8108cb47>] do_exit+0x6b7/0xa30
[ 2670.461709]  [<ffffffff816f758c>] oops_end+0x9c/0xe0
[ 2670.467649]  [<ffffffff81017fcb>] die+0x4b/0x70
[ 2670.473359]  [<ffffffff816f6e20>] do_trap+0x60/0x170
[ 2670.479445]  [<ffffffff810150aa>] do_invalid_op+0xaa/0xe0
[ 2670.485816]  [<ffffffffa00edfe4>] ? qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2670.494264]  [<ffffffff8117e148>] ? __alloc_pages_nodemask+0x188/0xb60
[ 2670.501765]  [<ffffffff81700c9e>] invalid_op+0x1e/0x30
[ 2670.508219]  [<ffffffffa00edfe4>] ? qlt_pci_map_calc_cnt+0x164/0x1a0 [qla2xxx]
[ 2670.516726]  [<ffffffffa00ef933>] qlt_rdy_to_xfer+0x93/0x210 [qla2xxx]
[ 2670.524462]  [<ffffffffa03c267c>] tcm_qla2xxx_write_pending+0x5c/0x60 [tcm_qla2xxx]
[ 2670.533387]  [<ffffffffa04ff677>] transport_generic_new_cmd+0x127/0x1a0 [target_core_mod]
[ 2670.542841]  [<ffffffffa04ff72b>] transport_handle_cdb_direct+0x3b/0x90 [target_core_mod]
[ 2670.552307]  [<ffffffffa04ff8c2>] target_submit_cmd_map_sgls+0x142/0x280 [target_core_mod]
[ 2670.561778]  [<ffffffffa04ffa59>] target_submit_cmd+0x59/0x60 [target_core_mod]
[ 2670.570364]  [<ffffffffa03c25ea>] tcm_qla2xxx_handle_cmd+0x5a/0x90 [tcm_qla2xxx]
[ 2670.578795]  [<ffffffffa00f0624>] qlt_do_work+0x2e4/0x460 [qla2xxx]
[ 2670.586343]  [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 2670.593454]  [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 2670.600058]  [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 2670.606910]  [<ffffffff810ae211>] kthread+0xe1/0x100
[ 2670.612885]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.620058]  [<ffffffff816ff4bc>] ret_from_fork+0x7c/0xb0
[ 2670.626578]  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 2670.633697] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 28 04 00 00 55 48 89 e5 <48> 8b
40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 2670.656112] RIP  [<ffffffff810ae800>] kthread_data+0x10/0x20
[ 2670.662623]  RSP <ffff8800d3eeb880>
[ 2670.666936] CR2: ffffffffffffffd8
[ 2670.671170] ---[ end trace ec1db6cbc3980107 ]---
[ 2670.676804] Fixing recursive fault but reboot is needed!
[ 2691.628643] ------------[ cut here ]------------

Looking at protocol traces on a Fibre Channel Protocol Analyzer, a Report
LUNs command requesting 0x1008 bytes of data (FCP_DL=0x1008) appears to
trigger the crash.

It is a rare target that has 0x1008 (5004) bytes of LUN data to return.
What should happen is the actual data available should be returned and a
residual under flag be set in the response along with the residual under
field being set to the number of bytes not returned.  This system
actually is occasionally attached to a large storage array and
presumably that is why the large buffer is allocated for LUNs responses.

AFAICT, the 'what should happen' is the current behavior in target core.

That is, the REPORT_LUNS response sets the underflow bit and associated
residual length in target_complete_cmd_with_length() when the outgoing
REPORT_LUNS payload is smaller than the received SCSI EDTL.

Since many of the Qla2xxx operations seem to pad (with zeros) the
responses out to whatever the requested data length was, maybe this is
overrunning a 4k buffer somewhere?  Note, the responses should not be
padding and returning more data than what is actually available but this
is what we are seeing on the analyzer.  The data length on the request
is specifying a maximum to return, not how much has to be returned.

Ok, based on the comments below, I think this is a separate issue that
is specific to qla2xxx fabric, but is not the cause of the above
BUG_ON().

So, this appears to be a combination of 2 problems.  1) Returns are
padding with zeros to the size of the request rather than truncating to
the size of actual data on hand. 2) Apparently a 4k buffer is being
overrun and crashing the system when a 0x1008 (4k + 8) Request for LUNs
command is received (this is speculation on my part).

So based upon the above trace, the BUG_ON() being triggered is in the
WRITE code-path and doesn't appear related to a specific REPORT_LUNS
payload size.

Namely, there is a check in qlt_pci_map_calc_cnt() that ensures that a
non zero scatterlist count exists when a given command's scatterlists
are mapped into PCI address space for receiving the incoming WRITE
payload.

I'm thinking this could very well be some form of zero-length WRITE that
is occurring immediately after the REPORT_LUNs.

Is anything else showing up on the trace after the REPORT_LUNs..?

Note the target system only has 1 LUN and should never come close to
returning 4k of data.  What we actually see are 2 response PDUs with 2k
(LEX=0x0800) of data.  This is mostly 0's, only the first 16-24 bytes
has information.  These are followed by a response PDU with 8 bytes (of
zeros).  The system then crashes hard.

We have traces of the initiator system connected to a Hitachi FC disk
drive where the same Report LUNs command (w/FCP_DL=0x1008) receives a
single response with 16 bytes of data (Len=0x0010).  The drive functions
fine after this.

Looking at another initiator system and our LIO target, we see Requests
for LUNs asking for approximately 2k of data and the LIO-based target
system responds without crashing but does return approximately 2k of
data (mostly zeros).

Please let me know if I can provide any more information to facilitate
getting this fixed.  Access to the systems is extremely limited though.

Thanks.  I need to verify zero-length WRITEs work as expected first, and
then will take a look at why REPORT_LUNS is returning the extra payload
with qla2xxx.

OK, I've been able to verify with v3.14.13 + qla2xxx export that
zero-length WRITEs and residual handling for REPORT_LUNS with a 5004
byte transfer length are working as expected.

# sg_raw /dev/sg9 2a 00 00 00 00 00 00 00 00 00
SCSI Status: Good

# sg_luns -m 5004 -v /dev/sdj
     report luns cdb: a0 00 00 00 00 00 00 00 13 8c 00 00
     report luns: pass-through requested 5004 bytes but got 72 bytes
Lun list length = 8 which imples 1 lun entry
Report luns [select_report=0x0]:
     0000000000000000

Looking at the v3.14.y change-log, Roland's patch below to fix the
residual handling went in for v3.14.9, and should address the zero
padding issue you observed with the FC analyzer.

target: Report correct response length for some commands
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.14.y&id=9a7f545e75cef29dbdc4c585f3ae27c5f7f4ea48

It's still unclear if this is what triggered the BUG_ON() in
qlt_pci_map_calc_cnt() you originally reported, but at this point it
would be a good idea to retest with >= v3.14.13, preferably with extra
debug code in place to dump se_cmd if the same BUG_ON() triggers again.

Thanks,

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html