I would be interested learning about the performance increase it has compared to 10Gbit. I got the ConnectX-3 Pro but I am not using the rdma because support is not default available. sockperf ping-pong -i 192.168.2.13 -p 5001 -m 16384 -t 10 --pps=max sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=10.100 sec; SentMessages=81205; ReceivedMessages=81204 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=10.000 sec; SentMessages=80411; ReceivedMessages=80411 sockperf: ====> avg-lat= 61.638 (std-dev=7.525) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 61.638 usec sockperf: Total 80411 observations; each percentile contains 804.11 observations sockperf: ---> <MAX> observation = 1207.678 sockperf: ---> percentile 99.99 = 119.027 sockperf: ---> percentile 99.90 = 82.075 sockperf: ---> percentile 99.50 = 76.133 sockperf: ---> percentile 99.00 = 75.013 sockperf: ---> percentile 95.00 = 70.831 sockperf: ---> percentile 90.00 = 68.471 sockperf: ---> percentile 75.00 = 65.594 sockperf: ---> percentile 50.00 = 61.626 sockperf: ---> percentile 25.00 = 59.406 sockperf: ---> <MIN> observation = 40.527 [@c01 sbin]# sockperf ping-pong -i 192.168.10.112 -p 5001 -t 10 sockperf: == version #2.6 == sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s) [ 0] IP = 192.168.10.112 PORT = 5001 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=10.100 sec; SentMessages=431009; ReceivedMessages=431008 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=10.000 sec; SentMessages=426779; ReceivedMessages=426779 sockperf: ====> avg-lat= 11.660 (std-dev=1.102) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 11.660 usec sockperf: Total 426779 observations; each percentile contains 4267.79 observations sockperf: ---> <MAX> observation = 272.374 sockperf: ---> percentile 99.99 = 37.709 sockperf: ---> percentile 99.90 = 20.410 sockperf: ---> percentile 99.50 = 17.167 sockperf: ---> percentile 99.00 = 15.751 sockperf: ---> percentile 95.00 = 12.853 sockperf: ---> percentile 90.00 = 12.317 sockperf: ---> percentile 75.00 = 11.884 sockperf: ---> percentile 50.00 = 11.452 sockperf: ---> percentile 25.00 = 11.188 sockperf: ---> <MIN> observation = 8.995 -----Original Message----- From: Michael Green [mailto:green@xxxxxxxxxxxxx] Sent: 19 December 2018 21:00 To: Roman Penyaev; Mohamad Gebai Cc: ceph-users@xxxxxxxxxxxxxx Subject: Re: RDMA/RoCE enablement failed with (113) No route to host Thanks for the insights Mohammad and Roman. Interesting read. My interest in RDMA is purely from testing perspective. Still I would be interested if somebody who has RDMA enabled and running, to share their ceph.conf. My RDMA related entries are taken from Mellanox blog here https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide. They used Luminous and built it from source. I'm running binary distribution of Mimic here. ms_type = async+rdma ms_cluster = async+rdma ms_async_rdma_device_name = mlx5_0 ms_async_rdma_polling_us = 0 ms_async_rdma_local_gid=<node's_gid> Or, if somebody with knowledge of the code could tell me when is this "RDMAConnectedSocketImpl" error is printed might also be helpful. 2018-12-19 21:45:32.757 7f52b8548140 0 mon.rio@-1(probing).osd e25981 crush map has features 288514051259236352, adjusting msgr requires 2018-12-19 21:45:32.757 7f52b8548140 0 mon.rio@-1(probing).osd e25981 crush map has features 288514051259236352, adjusting msgr requires 2018-12-19 21:45:32.757 7f52b8548140 0 mon.rio@-1(probing).osd e25981 crush map has features 1009089991638532096, adjusting msgr requires 2018-12-19 21:45:32.757 7f52b8548140 0 mon.rio@-1(probing).osd e25981 crush map has features 288514051259236352, adjusting msgr requires 2018-12-19 21:45:33.138 7f52b8548140 0 mon.rio@-1(probing) e5 my rank is now 0 (was -1) 2018-12-19 21:45:33.141 7f529f3fe700 -1 RDMAConnectedSocketImpl activate failed to transition to RTR state: (113) No route to host 2018-12-19 21:45:33.142 7f529f3fe700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1 3.2.2/rpm/el7/BUILD/ceph-13.2.2/src/msg/async/rdma/RDMAConnectedSocketIm pl.cc: In function 'void RDMAConnectedSocketImpl::handle_connection()' thread 7f529f3fe700 time 2018-12-19 21:45:33.141972 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1 3.2.2/rpm/el7/BUILD/ceph-13.2.2/src/msg/async/rdma/RDMAConnectedSocketIm pl.cc: 224: FAILED assert(!r) -- Michael Green On Dec 19, 2018, at 5:21 AM, Roman Penyaev <rpenyaev@xxxxxxx> wrote: Well, I am playing with ceph rdma implementation quite a while and it has unsolved problems, thus I would say the status is "not completely broken", but "you can run it on your own risk and smile": 1. On disconnect of previously active (high write load) connection there is a race that can lead to osd (or any receiver) crash: https://github.com/ceph/ceph/pull/25447 2. Recent qlogic hardware (qedr drivers) does not support IBV_EVENT_QP_LAST_WQE_REACHED, which is used in ceph rdma implementation, pull request from 1. also targets this incompatibility. 3. On high write load and many connections there is a chance, that osd can run out of receive WRs and rdma connection (QP) on sender side will get IBV_WC_RETRY_EXC_ERR, thus disconnected. This is fundamental design problem, which has to be fixed on protocol level (e.g. propagate backpressure to senders). 4. Unfortunately neither rdma or any other 0-latency network can bring significant value, because the bottle neck is not a network, please consider this for further reading regarding transport performance in ceph: https://www.spinics.net/lists/ceph-devel/msg43555.html Problems described above have quite a big impact on overall transport performance. -- Roman _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com