Re: Race condition between / wrong load order of ib_umad and ib_ipoib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 02, 2020 at 05:11:31PM +0200, Benjamin Drung wrote:
> Hi,
> 
> after a kernel upgrade to version 4.19 (in-house built with Mellanox
> OFED drivers), some of our systems fail to bring up their IPoIB devices
> on boot. Different HCAs are affected (e.g. MT4099 and MT26428). We are
> using rdma-core on Debian and have IPoIB devices (like `ib0.dddd`)
> configured in `/etc/network/interfaces`. Big cluster seem to be more
> affected than smaller ones. In case of the failure, we see this kernel
> message:
> 
> ```
> ib0.dddd: P_Key 0xdddd is not found
> ```

I think this means you are missing some IPoIB bug fixes?

This warning means ipoib was started before the subnet manager had
programmed in the pkey table. (ie it is a race)

The way it is supposed to work is for IPoIB to create the interface
anyhow in the down state and wait for the SM to program the pkey, then
move to the up state.

> Pinging other hosts will fail then with:
> 
> ```
> ping: sendmsg: Network is unreachable
> ```

This suggests ipoib is stuck down, so it missed the pkey change
event..

> changing the order in this configuration file to load `ib_umad` before
> `ib_ipoib`, the servers come up correctly.

This is probably just adding enough delay that the SM has setup pkey
table before starting ipoib...

Jason 



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux