Does RDMA support flow control in GlusterFS now

Paul <flypen@xxxxxxxxx> · Sat, 23 Dec 2017 17:44:27 +0800

We use Mellanox Infiniband card to create an IB cluster. There are several storage nodes, and more than 20 clients. GlusterFS version is 3.11. The storage OS is CentOS 6.5, and the client OS is CentOS 7.3. Previously we used IP over IB and everything was OK. After we use RDMA, we get higher bandwidth, but we often see some brick disconnecting messages in client logs, and we can't see abnormal things in brick logs at the same time. Although all bricks are reconnected finally, this problem leads to some serious problems. For example, it takes several minutes to run a simple "ls" or "df" command.
Here is an example of brick disconnecting log on one client:
[2017-12-21 10:45:47.476597] C [rpc-clnt-ping.c:186:rpc_clnt_ping_timer_expired] 0-data-client-129: server 10.0.0.35:49204 has not responded in the last 60 seconds, disconnecting.(trans1:0,trans2:0)
[2017-12-21 10:45:47.478820] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-data-client-129: disconnected from data-client-129. Client process will keep trying to connect to glusterd until brick's port is available
[2017-12-21 10:45:47.479267] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] ))))) 0-data-client-129: forced unwinding frame type(GlusterFS 3.3) op(ENTRYLK(31)) called at 2017-12-23 10:43:52.887616 (xid=0x9da3f5)
[2017-12-21 10:45:47.479317] E [MSGID: 114031] [client-rpc-fops.c:1646:client3_3_entrylk_cbk] 0-data-client-129: remote operation failed [Transport endpoint is not connected]
[2017-12-21 10:45:47.479352] E [MSGID: 108007] [afr-lk-common.c:825:afr_unlock_entrylk_cbk] 0-data-replicate-64: /data/a3581.data: unlock failed on data-client-129 [Transport endpoint is not connected]
[2017-12-21 10:45:47.479718] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] ))))) 0-data-client-129: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2017-12-23 10:43:53.249305 (xid=0x9da3f6)
[2017-12-21 10:45:47.479771] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data-client-129: remote operation failed. Path: /data/b07869.data (fe89d36e-16b8-4b06-bd36-69023217db9f) [Transport endpoint is not connected]
[2017-12-21 10:45:47.480644] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] ))))) 0-data-client-129: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2017-12-23 10:44:47.468222 (xid=0x9da3f7)
[2017-12-21 10:45:47.480682] W [rpc-clnt-ping.c:243:rpc_clnt_ping_cbk] 0-data-client-129: socket disconnected
[2017-12-21 10:45:47.481046] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data-client-129: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]
[2017-12-21 10:45:58.497609] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-data-client-129: changing port to 49204 (from 0)
[2017-12-21 10:45:58.512289] I [MSGID: 114057] [client-handshake.c:1451:select_server_supported_programs] 0-data-client-129: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-12-21 10:45:58.517383] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-data-client-129: Connected to data-client-129, attached to remote volume '/disks/xnuyUF3N/brick'..

We find the the hw counter of  rq_num_rnr for IB card in some clients is very big:
#cat /sys/class/infiniband/mlx4_0/ports/1/hw_counters/rq_num_rnr
943004905

And the corresponding value in storage node is also big:
cat /sys/class/infiniband/mlx4_0/ports/1/hw_counters/rq_num_rnr
23193068

If we use IP over IB, the counter value is 0. And on some clients in which we don't see brick disconnecting problems, we also see zero value of rq_num_rnr.

We guess it's a flow control problem of RDMA. One side sends data so fast and the other side can't receive them in time, and then rq_num_rnr increases.

Does RDMA support flow control in GlusterFS now?

And can we adjust these macros defined in rdma.h to avoid this problem?
/* Additional attributes */
#define GF_RDMA_TIMEOUT                14
#define GF_RDMA_RETRY_CNT              7
#define GF_RDMA_RNR_RETRY              7

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel