async+rdma configure problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi cephers:

I meet some problems when configure ceph+rdma.
The envirement is :
3 physical server with one IB card separately
one IB switch
ceph version: luminous 12.2.5

My config step is as follows:
1. setup the IB port mode as ib
2. setup the /etc/sysconfig/network-scripts/ifcfg-ib0 , and systemctl
restart network
3. modify the /etc/ceph/ceph.conf as follows:
                      ms_type = async+rdma
                      ms_public_type = async+rdma
                      ms_cluster_type = async+rdma
                      ms_async_rdma_polling_us = 0
                      ms_async_rdma_device_name=mlx4_0
                      ms_async_rdma_send_buffers=1024
                      ms_async_rdma_receive_buffers=1024
                      ms_async_rdma_local_gid =
fe80:0000:0000:0000:e41d:2d03:000f:9281
4.start ceph-mon

The problem is when I run "ceph -s", there is no return message.So I
open the debug option:
                 debug_mon = 50, debug_ms = 50"
and got the error :
RDMAStack handle_tx_event QP: 588 len: 0 , addr:0x7ff02c01bb30 RETRY_EXC_ERR
RDMAStack handle_tx_event connection between server and client not
working. Disconnect this now

It seems that the client got the wrong reply message and return error.
Can anyone give some advise?

Very appreciate.

The full fault message is as follows:
2018-08-09 15:37:13.973739 7ff040fb2700 20  RDMAConnectedSocketImpl send QP: 588
2018-08-09 15:37:13.973741 7ff040fb2700 20  RDMAConnectedSocketImpl
submit we need 9 bytes. iov size: 1
2018-08-09 15:37:13.973744 7ff040fb2700 30 RDMAStack get_reged_mem
need 9 bytes, reserve 131072 registered  bytes, inflight 0
2018-08-09 15:37:13.973747 7ff040fb2700 20  RDMAConnectedSocketImpl
submit left bytes: 0 in buffers 0 tx chunks 1
2018-08-09 15:37:13.973749 7ff040fb2700 20  RDMAConnectedSocketImpl
post_work_request QP: 588 0x7ff02c01bb30
2018-08-09 15:37:13.973751 7ff040fb2700 25  RDMAConnectedSocketImpl
post_work_request sending buffer: 0x7ff02c01bb30 length: 9
2018-08-09 15:37:13.973759 7ff040fb2700 20  RDMAConnectedSocketImpl
post_work_request qp state is IBV_QPS_RTS
2018-08-09 15:37:13.973813 7ff040fb2700 20  RDMAConnectedSocketImpl
submit finished sending 9 bytes.
2018-08-09 15:37:13.973815 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0
l=1)._try_send sent bytes 9 remaining bytes 0
2018-08-09 15:37:13.973820 7ff040fb2700 20 Event(0x7ff03c0e2ec0
nevent=5000 time_id=1).create_file_event create event started fd=17
mask=2 original mask is 1
2018-08-09 15:37:13.973821 7ff040fb2700 20 EpollDriver.add_event add
event fd=17 cur_mask=1 add_mask=2 to 6
2018-08-09 15:37:13.973828 7ff040fb2700 20 Event(0x7ff03c0e2ec0
nevent=5000 time_id=1).create_file_event create event end fd=17 mask=2
original mask is 3
2018-08-09 15:37:13.973832 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
pgs=0 cs=0 l=1)._process_connection connect write banner done:
192.168.4.62:6789/0
2018-08-09 15:37:13.973839 7ff040fb2700 20 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
pgs=0 cs=0 l=1).process prev state is STATE_CONNECTING_RE
2018-08-09 15:37:13.973843 7ff040fb2700 25 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
pgs=0 cs=0 l=1).read_until len is 281 state_offset is 0
2018-08-09 15:37:13.973848 7ff040fb2700 20  RDMAConnectedSocketImpl
read notify_fd : 1 in 588 r = 8
2018-08-09 15:37:13.973850 7ff040fb2700 25 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
pgs=0 cs=0 l=1).read_until read_bulk recv_end is 0 left is 281 got 0
2018-08-09 15:37:13.973854 7ff040fb2700 25 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
pgs=0 cs=0 l=1).read_until need len 281 remaining 281 bytes
2018-08-09 15:37:13.973858 7ff040fb2700 30 Event(0x7ff03c0e2ec0
nevent=5000 time_id=1).process_events event_wq process is 17 mask is 1
2018-08-09 15:37:13.973860 7ff040fb2700 30 stack operator() calling
event process
2018-08-09 15:37:13.973861 7ff040fb2700 30 Event(0x7ff03c0e2ec0
nevent=5000 time_id=1).process_events wait second 30 usec 0
2018-08-09 15:37:13.973863 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
pgs=0 cs=0 l=1).handle_write
2018-08-09 15:37:13.973866 7ff040fb2700 10 -- - >> 192.168.4.62:6789/0
conn(0x7ff03c150490 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
pgs=0 cs=0 l=1)._try_send sent bytes 0 remaining bytes 0
2018-08-09 15:37:13.973869 7ff040fb2700 30 Event(0x7ff03c0e2ec0
nevent=5000 time_id=1).process_events event_wq process is 17 mask is 2
2018-08-09 15:37:13.973871 7ff040fb2700 30 stack operator() calling
event process
2018-08-09 15:37:13.973872 7ff040fb2700 30 Event(0x7ff03c0e2ec0
nevent=5000 time_id=1).process_events wait second 30 usec 0
2018-08-09 15:37:18.107161 7ff038dd4700 20 RDMAStack polling got tx cq event.
2018-08-09 15:37:18.107177 7ff038dd4700 20 RDMAStack polling tx
completion queue got 1 responses.
2018-08-09 15:37:18.107189 7ff038dd4700 25 RDMAStack handle_tx_event
QP: 588 len: 0 , addr:0x7ff02c01bb30 RETRY_EXC_ERR
2018-08-09 15:37:18.107194 7ff038dd4700  1 RDMAStack handle_tx_event
connection between server and client not working. Disconnect this now
2018-08-09 15:37:18.107198 7ff038dd4700 25 RDMAStack handle_tx_event
qp state is : IBV_QPS_ERR
2018-08-09 15:37:18.107402 7ff038dd4700  1  RDMAConnectedSocketImpl
fault tcp fd 18
2018-08-09 15:37:18.107398 7ff040fb2700 20  RDMAConnectedSocketImpl
handle_connection QP: 588 tcp_fd: 18 notify_fd: 17
2018-08-09 15:37:18.107410 7ff038dd4700 30 RDMAStack post_tx_buffer
release 1 chunks, inflight 0
2018-08-09 15:37:18.107414 7ff038dd4700 30 RDMAStack handle_async_event
2018-08-09 15:37:18.107417 7ff040fb2700 10 Infiniband recv_msg got
disconnect message
2018-08-09 15:37:18.107418 7ff038dd4700 10 RDMAStack
handle_async_event event associated qp=0x7ff02c0091a0 evt: last WQE
reached
2018-08-09 15:37:18.107421 7ff038dd4700  1 RDMAStack
handle_async_event it's not forwardly stopped by us,
reenable=0x7ff02c008ea0
2018-08-09 15:37:18.107420 7ff040fb2700 20  RDMAConnectedSocketImpl
handle_connection peer msg :  < 589, 14207609, 56, 0>
2018-08-09 15:37:18.107424 7ff038dd4700  1  RDMAConnectedSocketImpl
fault tcp fd 18
2018-08-09 15:37:18.107429 7ff038dd4700 20 Infiniband rearm_notify started.
2018-08-09 15:37:18.107431 7ff038dd4700 20 Infiniband rearm_notify started.

Thank you for your help.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux