We use Mellanox Infiniband card to create an IB cluster. There are several storage nodes, and more than 20 clients. GlusterFS version is 3.11. The storage OS is CentOS 6.5, and the client OS is CentOS 7.3. Previously we used IP over IB and everything was OK. After we use RDMA, we get higher bandwidth, but we often see some brick disconnecting messages in client logs, and we can't see abnormal things in brick logs at the same time. Although all bricks are reconnected finally, this problem leads to some serious problems. For example, it takes several minutes to run a simple "ls" or "df" command.
Here is an example of brick disconnecting log on one client:
[2017-12-21 10:45:47.476597] C [rpc-clnt-ping.c:186:rpc_clnt_ping_timer_expired] 0-data-client-129: server 10.0.0.35:49204 has not responded in the last 60 seconds, disconnecting.(trans1:0,trans2:0)
[2017-12-21 10:45:47.478820] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-data-client-129: disconnected from data-client-129. Client process will keep trying to connect to glusterd until brick's port is available
[2017-12-21 10:45:47.479267] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] ))))) 0-data-client-129: forced unwinding frame type(GlusterFS 3.3) op(ENTRYLK(31)) called at 2017-12-23 10:43:52.887616 (xid=0x9da3f5)
[2017-12-21 10:45:47.479317] E [MSGID: 114031] [client-rpc-fops.c:1646:client3_3_entrylk_cbk] 0-data-client-129: remote operation failed [Transport endpoint is not connected]
[2017-12-21 10:45:47.479352] E [MSGID: 108007] [afr-lk-common.c:825:afr_unlock_entrylk_cbk] 0-data-replicate-64: /data/a3581.data: unlock failed on data-client-129 [Transport endpoint is not connected]
[2017-12-21 10:45:47.479718] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] ))))) 0-data-client-129: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2017-12-23 10:43:53.249305 (xid=0x9da3f6)
[2017-12-21 10:45:47.479771] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data-client-129: remote operation failed. Path: /data/b07869.data (fe89d36e-16b8-4b06-bd36-69023217db9f) [Transport endpoint is not connected]
[2017-12-21 10:45:47.480644] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] ))))) 0-data-client-129: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2017-12-23 10:44:47.468222 (xid=0x9da3f7)
[2017-12-21 10:45:47.480682] W [rpc-clnt-ping.c:243:rpc_clnt_ping_cbk] 0-data-client-129: socket disconnected
[2017-12-21 10:45:47.481046] W [MSGID: 114031] [client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data-client-129: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]
[2017-12-21 10:45:58.497609] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-data-client-129: changing port to 49204 (from 0)
[2017-12-21 10:45:58.512289] I [MSGID: 114057] [client-handshake.c:1451:select_server_supported_programs] 0-data-client-129: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-12-21 10:45:58.517383] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-data-client-129: Connected to data-client-129, attached to remote volume '/disks/xnuyUF3N/brick'..
We find the the hw counter of rq_num_rnr for IB card in some clients is very big:
#cat /sys/class/infiniband/mlx4_0/ports/1/hw_counters/rq_num_rnr
943004905
And the corresponding value in storage node is also big:
cat /sys/class/infiniband/mlx4_0/ports/1/hw_counters/rq_num_rnr
23193068
If we use IP over IB, the counter value is 0. And on some clients in which we don't see brick disconnecting problems, we also see zero value of rq_num_rnr.
We guess it's a flow control problem of RDMA. One side sends data so fast and the other side can't receive them in time, and then rq_num_rnr increases.
Does RDMA support flow control in GlusterFS now?
And can we adjust these macros defined in rdma.h to avoid this problem?
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel