3.1.1 crashing under moderate load

lana.deere at gmail.com (Lana Deere) · Fri, 3 Dec 2010 15:13:30 -0500

I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
transport, native/fuse access.

I have a directory which is shared on the gluster.  In fact, it is a clone
of /lib from one of the clients, shared so all can see it.

I have a script which does
    find lib -type f -print0 | xargs -0 sum | md5sum

If I run this on my clients one at a time, they all yield the same md5sum:
    for h in <<hosts>>; do ssh $host script; done

If I run this on my clients concurrently, up to roughly 25 at a time they
still yield the same md5sum.
    for h in <<hosts>>; do ssh $host script& done

Beyond that the gluster share often, but not always, fails.  The errors vary.
    - sometimes I get "sum: xxx.so not found"
    - sometimes I get the wrong checksum without any error message
    - sometimes the job simply hangs until I kill it

Some of the server logs show messages like these from the time of the
failures (other servers show nothing from around that time):

[2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
socket (peer: 10.54.255.240:1022) after handshake is complete
[2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e82, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
Reply submission failed
[2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e83, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
Reply submission failed

On a client which had a failure I see messages like:

[2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
(peer: 10.54.50.101:24009) after handshake is complete
[2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20492
[2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20529
[2010-12-03 10:03:06.26827] I
[client-handshake.c:993:select_server_supported_programs]
RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
Version (310)
[2010-12-03 10:03:06.27029] I
[client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
Connected to 10.54.50.101:24009, attached to remote volume '/data'.
[2010-12-03 10:03:06.27067] I
[client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
fds open - Delaying child_up until they are re-opened

Anyone else seen anything like this and/or have suggestions about options I can
set to work around this?

.. Lana (lana.deere at gmail.com)