On Tue, Jun 02, 2020 at 05:11:31PM +0200, Benjamin Drung wrote: > Hi, > > after a kernel upgrade to version 4.19 (in-house built with Mellanox > OFED drivers), some of our systems fail to bring up their IPoIB devices > on boot. Different HCAs are affected (e.g. MT4099 and MT26428). We are > using rdma-core on Debian and have IPoIB devices (like `ib0.dddd`) > configured in `/etc/network/interfaces`. Big cluster seem to be more > affected than smaller ones. In case of the failure, we see this kernel > message: > > ``` > ib0.dddd: P_Key 0xdddd is not found > ``` I think this means you are missing some IPoIB bug fixes? This warning means ipoib was started before the subnet manager had programmed in the pkey table. (ie it is a race) The way it is supposed to work is for IPoIB to create the interface anyhow in the down state and wait for the SM to program the pkey, then move to the up state. > Pinging other hosts will fail then with: > > ``` > ping: sendmsg: Network is unreachable > ``` This suggests ipoib is stuck down, so it missed the pkey change event.. > changing the order in this configuration file to load `ib_umad` before > `ib_ipoib`, the servers come up correctly. This is probably just adding enough delay that the SM has setup pkey table before starting ipoib... Jason