Re: qla2xxx crashing kernel

Craig Watson <craig.watson@xxxxxxxxxxxxxxxxxxx> · Fri, 19 Dec 2014 08:41:26 -0500

Hi,

I wanted to send out a big thank you to all who worked on this. 
Yesterday we got access to the customer system again with an update.  
Roland Drier's patches for correct response length and zero-length 
commend handling in a 3.17.6 kernel work with no crashing of the 
kernel!  This is a problem that has plagued us connecting to an old, 
irreplaceable VxWorks based system for over a year.  I wanted to let you 
all know this is resolved and again say thank you for all your hard work 
here.

Craig Watson

On 10/21/2014 07:04 PM, Nicholas A. Bellinger wrote:
On Tue, 2014-10-14 at 14:07 -0700, Roland Dreier wrote:
The problem I am faced with is something is causing the kernel to crash and
there aren't any zero-length commands after the Report LUNs command.
You say there are no zero-length commands, but...

Here's the details from the protocol analyzer of the crash with a few
comments thrown in.  Note the only exchange after the large Report LUNs
command does not appear to be a zero-length command.
...

After this exchange, the LIO target unit no longer responds to the initiator
as shown by the trace lines below, because the Linux kernel has crashed.

Eventually the Initiator logs off the link (LOGO) at time index
00:52.254_510_372

        00:41.006_592_808    141.916        FC Port(1:1:1) FC4Cmd
Test Unit Ready; LUN = 0x0000; FCP_DL = 0x00000000;            68    0000EF
0000E8    0000    0008 FFFF    0000    Originator; First_Sequence;
End_Sequence; Transfer Sequence Initiative;     00000000            1EDFA836
(Correct)    EOFt(-)
        00:44.191_823_484 3185230.676    FC Port(1:1:1) ABTS
ABTS; Basic Link Service; Abort Exchange;                    36    0000EF
0000E8        0008    FFFF 0000    Originator; End_Sequence; Transfer
Sequence Initiative;                 EA3CE37F    (Correct)    EOFt(+)
Here clearly the initiator clearly *does* send a zero-length command
(TEST UNIT READY), and the target does not respond.  The logical
conclusion is that *this* is the command that crashes the target,
since the target handle the previous INQUIRY command just fine, and
we're not responding to any commands after a panic.

And indeed, I have a theory about what is happening: I suspect the
initiator is sending a TEST UNIT READY command with the "WRDATA" bit
set, so that the target core treats it as a command with a data-out
phase (transferring data from the initiator), but with a data length
of 0.  Unfortunately I don't see any indication of the value of WRDATA
in your trace.  Looking at the target code, I think it does not handle
this correctly, even in the latest upstream code.  I have a local
patch that I thought was upstream that I'll send shortly.

AFAICT, this theory makes perfect sense based upon the analyzer output.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html