I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA transport, native/fuse access. I have a directory which is shared on the gluster. In fact, it is a clone of /lib from one of the clients, shared so all can see it. I have a script which does find lib -type f -print0 | xargs -0 sum | md5sum If I run this on my clients one at a time, they all yield the same md5sum: for h in <<hosts>>; do ssh $host script; done If I run this on my clients concurrently, up to roughly 25 at a time they still yield the same md5sum. for h in <<hosts>>; do ssh $host script& done Beyond that the gluster share often, but not always, fails. The errors vary. - sometimes I get "sum: xxx.so not found" - sometimes I get the wrong checksum without any error message - sometimes the job simply hangs until I kill it Some of the server logs show messages like these from the time of the failures (other servers show nothing from around that time): [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler] rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp socket (peer: 10.54.255.240:1022) after handshake is complete [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x55e82, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport (rdma.RaidData-server) [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] : Reply submission failed [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x55e83, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport (rdma.RaidData-server) [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] : Reply submission failed On a client which had a failure I see messages like: [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler] rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket (peer: 10.54.50.101:24009) after handshake is complete [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2010-12-03 10:03:06.20492 [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2010-12-03 10:03:06.20529 [2010-12-03 10:03:06.26827] I [client-handshake.c:993:select_server_supported_programs] RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-12-03 10:03:06.27029] I [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1: Connected to 10.54.50.101:24009, attached to remote volume '/data'. [2010-12-03 10:03:06.27067] I [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2 fds open - Delaying child_up until they are re-opened Anyone else seen anything like this and/or have suggestions about options I can set to work around this? .. Lana (lana.deere at gmail.com)