Re: CRASH 3.18-rc2, 3.17.1, isert_connect_request

Sagi Grimberg <sagig@xxxxxxxxxxxxxxxxxx> · Tue, 04 Nov 2014 18:44:57 +0200

On 11/4/2014 10:50 AM, Adam Mazur wrote:
W dniu 03.11.2014 o 12:50, Adam Mazur pisze:
W dniu 03.11.2014 o 12:27, Sagi Grimberg pisze:
On 11/3/2014 12:28 PM, Adam Mazur wrote:
Can someone help us with these crashes? We are not able to recreate it
on demand, but it takes 30 minutes to a few hours to appear the crash.
We've seen it on kernel 3.17.1 and 3.18-rc2.

Hay Adam,

CC'ing target-devel mailing list (where iser target is maintained).

So I stepped on this issue as well, and I actually have a fix for it
in the pipe. I'm planning to test it with a few other fixes for a little
while longer before I submit the code.

In general, This crash occurs due to a race between tpg shutdown (or
np disable) and RDMA_CM connect requests happening in parallel. iser
target tries to reference a tpg attribute while the np->tpg_np is
actually NULL.

How many targets/initiators/portals did you use? HCA?

Hi Sagi,

There are about 300 targets (lvm volumes), 4 initiators, two portals.

HCA by lspci:
05:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
         Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
         Flags: bus master, fast devsel, latency 0, IRQ 46
         Memory at df500000 (64-bit, non-prefetchable) [size=1M]
         Memory at de800000 (64-bit, prefetchable) [size=8M]
         Capabilities: [40] Power Management version 2
         Capabilities: [48] Vital Product Data
         Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
         Capabilities: [84] MSI-X: Enable+ Count=32 Masked-
         Capabilities: [60] Express Endpoint, MSI 00
         Kernel driver in use: ib_mthca

root@portal-1:~# mstflint -d 05:00.0 q
Image type:      Failsafe
FW Version:      1.2.0
I.S. Version:    1
Device ID:       25204
Chip Revision:   A0
Description:     Node             Port1            Sys image
GUIDs:           0005ad00000c75c8 0005ad00000c75c9 0005ad00000c75cb
Board ID:         (MT_0260000002)
VSD:             
PSID:            MT_0260000002

root@portal-2:~# mstflint -d 05:00.0 q
Image type:      Failsafe
I.S. Version:    1
Chip Revision:   A0
Description:     Node             Port1            Sys image
GUIDs:           0005ad00000c7010 0005ad00000c7011 0005ad00000c7013
Board ID:         (MT_0260000002)
VSD:             
PSID:            MT_0260000002

Would it be possible to send you some patches to test as well?

Absolutely, we can immediately test any patch on any kernel version.

Thanks
Adam

The race is supposedly caused by login ddos of initiators that are not
PI aware - our initiators were running kernels from 3.2 to 3.17.

This bug has nothing to do with the initiators or their awareness to PI.
The race itself is related to PI though.

When
we've upgraded all to kernels > 3.15 new targets seem to be stable.
However it shows that the race is lurking somewhere as You have pointed
out.

Yea, the race is still there.

I have some patches under testing and need cleaning up before they go on
the mailing list...

Thank You for the feedback received. Later we will try to prepare a
testcase that might expose the crash.

I think full target stack unload while lots of initiators are
connected should invoke this race...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html