Fail to establish RoCE connectivity after restarting network service

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

We found a RoCE connectivity problem in such environment:
* two servers with ConnectX4 NICs (model: MCX414A-BCAT)
* servers are connected to one SX6036 switch where flow control is enabled

Linux distributions, drivers and kernels:
* CentOS 7.4, CentOS 7.9 and Ubuntu 22.04
* OFED 4.9, 5.4, 5.6
* distribution default kernels: 3.10.693 for centos 7.4, 3.10.1160 for centos 7.9, 5.15 for ubuntu 22.04
* vanilla kernels: 5.10 and 5.18 getting from Linux archive

rping is used to test RoCE links connectivity between servers. At initial, they can establish RoCE connections (rping to each other works). However, after we did ifdown/ifup the interfaces, restart network services, or rebooted the two servers, the connectivity between the two servers may become abnormal: i.e. sometimes the active side was stuck at "rdma_connect" without any CM event generated later; sometimes, the connection can be established, but the sender side failed to send message to the receiver (with error 12: IBV_WC_RETRY_EXC_ERR). If we repeat ifdown/up the affected interface or restart the network service for several rounds, the connectivity between the two servers can eventually become normal. We repeated this test on various linux distributions, OFED drivers and kernel versions as listed above, and found that this problem can be reproduced on all these setups. TCP/IP connections are always working as expected. We are not sure whether it is a bug or a configuration problem. Is there any method to troubleshoot this problem? Any suggestion is appreciated. 

Thanks,
Meng Wang



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux