Re: CRASH 3.18-rc2, 3.17.1, isert_connect_request

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



W dniu 03.11.2014 o 12:50, Adam Mazur pisze:
W dniu 03.11.2014 o 12:27, Sagi Grimberg pisze:
On 11/3/2014 12:28 PM, Adam Mazur wrote:
Can someone help us with these crashes? We are not able to recreate it
on demand, but it takes 30 minutes to a few hours to appear the crash.
We've seen it on kernel 3.17.1 and 3.18-rc2.


Hay Adam,

CC'ing target-devel mailing list (where iser target is maintained).

So I stepped on this issue as well, and I actually have a fix for it
in the pipe. I'm planning to test it with a few other fixes for a little
while longer before I submit the code.

In general, This crash occurs due to a race between tpg shutdown (or
np disable) and RDMA_CM connect requests happening in parallel. iser
target tries to reference a tpg attribute while the np->tpg_np is
actually NULL.

How many targets/initiators/portals did you use? HCA?

Hi Sagi,

There are about 300 targets (lvm volumes), 4 initiators, two portals.

HCA by lspci:
05:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
HCA] (rev 20)
         Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
         Flags: bus master, fast devsel, latency 0, IRQ 46
         Memory at df500000 (64-bit, non-prefetchable) [size=1M]
         Memory at de800000 (64-bit, prefetchable) [size=8M]
         Capabilities: [40] Power Management version 2
         Capabilities: [48] Vital Product Data
         Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
         Capabilities: [84] MSI-X: Enable+ Count=32 Masked-
         Capabilities: [60] Express Endpoint, MSI 00
         Kernel driver in use: ib_mthca


root@portal-1:~# mstflint -d 05:00.0 q
Image type:      Failsafe
FW Version:      1.2.0
I.S. Version:    1
Device ID:       25204
Chip Revision:   A0
Description:     Node             Port1            Sys image
GUIDs:           0005ad00000c75c8 0005ad00000c75c9 0005ad00000c75cb
Board ID:         (MT_0260000002)
VSD:             
PSID:            MT_0260000002


root@portal-2:~# mstflint -d 05:00.0 q
Image type:      Failsafe
I.S. Version:    1
Chip Revision:   A0
Description:     Node             Port1            Sys image
GUIDs:           0005ad00000c7010 0005ad00000c7011 0005ad00000c7013
Board ID:         (MT_0260000002)
VSD:             
PSID:            MT_0260000002


Would it be possible to send you some patches to test as well?

Absolutely, we can immediately test any patch on any kernel version.

Thanks
Adam


The race is supposedly caused by login ddos of initiators that are not PI aware - our initiators were running kernels from 3.2 to 3.17. When we've upgraded all to kernels > 3.15 new targets seem to be stable. However it shows that the race is lurking somewhere as You have pointed out. Thank You for the feedback received. Later we will try to prepare a testcase that might expose the crash.

Best,
Adam

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux