Hi cephers: I meet some problems when configure ceph+rdma. The envirement is : 3 physical server with one IB card separately one IB switch ceph version: luminous 12.2.5 My config step is as follows: 1. setup the IB port mode as ib 2. setup the /etc/sysconfig/network-scripts/ifcfg-ib0 , and systemctl restart network 3. modify the /etc/ceph/ceph.conf as follows: ms_type = async+rdma ms_public_type = async+rdma ms_cluster_type = async+rdma ms_async_rdma_polling_us = 0 ms_async_rdma_device_name=mlx4_0 ms_async_rdma_send_buffers=1024 ms_async_rdma_receive_buffers=1024 ms_async_rdma_local_gid = fe80:0000:0000:0000:e41d:2d03:000f:9281 4.start ceph-mon The problem is when I run "ceph -s", there is no return message.So I open the debug option: debug_mon = 50, debug_ms = 50" and got the error : RDMAStack handle_tx_event QP: 588 len: 0 , addr:0x7ff02c01bb30 RETRY_EXC_ERR RDMAStack handle_tx_event connection between server and client not working. Disconnect this now It seems that the client got the wrong reply message and return error. Can anyone give some advise? Very appreciate. The full fault message is as follows: 2018-08-09 15:37:13.973739 7ff040fb2700 20 RDMAConnectedSocketImpl send QP: 588 2018-08-09 15:37:13.973741 7ff040fb2700 20 RDMAConnectedSocketImpl submit we need 9 bytes. iov size: 1 2018-08-09 15:37:13.973744 7ff040fb2700 30 RDMAStack get_reged_mem need 9 bytes, reserve 131072 registered bytes, inflight 0 2018-08-09 15:37:13.973747 7ff040fb2700 20 RDMAConnectedSocketImpl submit left bytes: 0 in buffers 0 tx chunks 1 2018-08-09 15:37:13.973749 7ff040fb2700 20 RDMAConnectedSocketImpl post_work_request QP: 588 0x7ff02c01bb30 2018-08-09 15:37:13.973751 7ff040fb2700 25 RDMAConnectedSocketImpl post_work_request sending buffer: 0x7ff02c01bb30 length: 9 2018-08-09 15:37:13.973759 7ff040fb2700 20 RDMAConnectedSocketImpl post_work_request qp state is IBV_QPS_RTS 2018-08-09 15:37:13.973813 7ff040fb2700 20 RDMAConnectedSocketImpl submit finished sending 9 bytes. 2018-08-09 15:37:13.973815 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=1)._try_send sent bytes 9 remaining bytes 0 2018-08-09 15:37:13.973820 7ff040fb2700 20 Event(0x7ff03c0e2ec0 nevent=5000 time_id=1).create_file_event create event started fd=17 mask=2 original mask is 1 2018-08-09 15:37:13.973821 7ff040fb2700 20 EpollDriver.add_event add event fd=17 cur_mask=1 add_mask=2 to 6 2018-08-09 15:37:13.973828 7ff040fb2700 20 Event(0x7ff03c0e2ec0 nevent=5000 time_id=1).create_file_event create event end fd=17 mask=2 original mask is 3 2018-08-09 15:37:13.973832 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect write banner done: 192.168.4.62:6789/0 2018-08-09 15:37:13.973839 7ff040fb2700 20 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).process prev state is STATE_CONNECTING_RE 2018-08-09 15:37:13.973843 7ff040fb2700 25 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).read_until len is 281 state_offset is 0 2018-08-09 15:37:13.973848 7ff040fb2700 20 RDMAConnectedSocketImpl read notify_fd : 1 in 588 r = 8 2018-08-09 15:37:13.973850 7ff040fb2700 25 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).read_until read_bulk recv_end is 0 left is 281 got 0 2018-08-09 15:37:13.973854 7ff040fb2700 25 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).read_until need len 281 remaining 281 bytes 2018-08-09 15:37:13.973858 7ff040fb2700 30 Event(0x7ff03c0e2ec0 nevent=5000 time_id=1).process_events event_wq process is 17 mask is 1 2018-08-09 15:37:13.973860 7ff040fb2700 30 stack operator() calling event process 2018-08-09 15:37:13.973861 7ff040fb2700 30 Event(0x7ff03c0e2ec0 nevent=5000 time_id=1).process_events wait second 30 usec 0 2018-08-09 15:37:13.973863 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).handle_write 2018-08-09 15:37:13.973866 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0 conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._try_send sent bytes 0 remaining bytes 0 2018-08-09 15:37:13.973869 7ff040fb2700 30 Event(0x7ff03c0e2ec0 nevent=5000 time_id=1).process_events event_wq process is 17 mask is 2 2018-08-09 15:37:13.973871 7ff040fb2700 30 stack operator() calling event process 2018-08-09 15:37:13.973872 7ff040fb2700 30 Event(0x7ff03c0e2ec0 nevent=5000 time_id=1).process_events wait second 30 usec 0 2018-08-09 15:37:18.107161 7ff038dd4700 20 RDMAStack polling got tx cq event. 2018-08-09 15:37:18.107177 7ff038dd4700 20 RDMAStack polling tx completion queue got 1 responses. 2018-08-09 15:37:18.107189 7ff038dd4700 25 RDMAStack handle_tx_event QP: 588 len: 0 , addr:0x7ff02c01bb30 RETRY_EXC_ERR 2018-08-09 15:37:18.107194 7ff038dd4700 1 RDMAStack handle_tx_event connection between server and client not working. Disconnect this now 2018-08-09 15:37:18.107198 7ff038dd4700 25 RDMAStack handle_tx_event qp state is : IBV_QPS_ERR 2018-08-09 15:37:18.107402 7ff038dd4700 1 RDMAConnectedSocketImpl fault tcp fd 18 2018-08-09 15:37:18.107398 7ff040fb2700 20 RDMAConnectedSocketImpl handle_connection QP: 588 tcp_fd: 18 notify_fd: 17 2018-08-09 15:37:18.107410 7ff038dd4700 30 RDMAStack post_tx_buffer release 1 chunks, inflight 0 2018-08-09 15:37:18.107414 7ff038dd4700 30 RDMAStack handle_async_event 2018-08-09 15:37:18.107417 7ff040fb2700 10 Infiniband recv_msg got disconnect message 2018-08-09 15:37:18.107418 7ff038dd4700 10 RDMAStack handle_async_event event associated qp=0x7ff02c0091a0 evt: last WQE reached 2018-08-09 15:37:18.107421 7ff038dd4700 1 RDMAStack handle_async_event it's not forwardly stopped by us, reenable=0x7ff02c008ea0 2018-08-09 15:37:18.107420 7ff040fb2700 20 RDMAConnectedSocketImpl handle_connection peer msg : < 589, 14207609, 56, 0> 2018-08-09 15:37:18.107424 7ff038dd4700 1 RDMAConnectedSocketImpl fault tcp fd 18 2018-08-09 15:37:18.107429 7ff038dd4700 20 Infiniband rearm_notify started. 2018-08-09 15:37:18.107431 7ff038dd4700 20 Infiniband rearm_notify started. Thank you for your help. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html