Also what is the configuration are you using? Is it replicate or distribute or striped or single client and server? Can you attach entire client and server log files? ----- Original Message ----- From: "Raghavendra G" <raghavendra at gluster.com> To: "Lana Deere" <lana.deere at gmail.com> Cc: gluster-users at gluster.org Sent: Monday, December 6, 2010 9:32:36 AM Subject: Re: 3.1.1 crashing under moderate load correction: "clients are mounted on how many physical nodes are there in your test setup?" should have been: "clients are mounted on how many physical nodes in your test setup?" ----- Original Message ----- From: "Raghavendra G" <raghavendra at gluster.com> To: "Lana Deere" <lana.deere at gmail.com> Cc: gluster-users at gluster.org Sent: Monday, December 6, 2010 9:30:57 AM Subject: Re: 3.1.1 crashing under moderate load Hi Lana, I need some clarifications about test setup: * Are you seeing problem when there are more than 25 clients? If this is the case, are these clients on different physical nodes or is it that more than one client shares same node? In other words, clients are mounted on how many physical nodes are there in your test setup? Also, are you running the command on each of these clients simultaneously? * Or is it that there are more than 25 concurrent concurrent invocations of the script? If this is the case, how many clients are present in your test setup and on how many physical nodes these clients are mounted? regards, ----- Original Message ----- From: "Lana Deere" <lana.deere at gmail.com> To: gluster-users at gluster.org Sent: Saturday, December 4, 2010 12:13:30 AM Subject: 3.1.1 crashing under moderate load I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA transport, native/fuse access. I have a directory which is shared on the gluster. In fact, it is a clone of /lib from one of the clients, shared so all can see it. I have a script which does find lib -type f -print0 | xargs -0 sum | md5sum If I run this on my clients one at a time, they all yield the same md5sum: for h in <<hosts>>; do ssh $host script; done If I run this on my clients concurrently, up to roughly 25 at a time they still yield the same md5sum. for h in <<hosts>>; do ssh $host script& done Beyond that the gluster share often, but not always, fails. The errors vary. - sometimes I get "sum: xxx.so not found" - sometimes I get the wrong checksum without any error message - sometimes the job simply hangs until I kill it Some of the server logs show messages like these from the time of the failures (other servers show nothing from around that time): [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler] rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp socket (peer: 10.54.255.240:1022) after handshake is complete [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x55e82, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport (rdma.RaidData-server) [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] : Reply submission failed [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x55e83, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport (rdma.RaidData-server) [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] : Reply submission failed On a client which had a failure I see messages like: [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler] rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket (peer: 10.54.50.101:24009) after handshake is complete [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2010-12-03 10:03:06.20492 [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2010-12-03 10:03:06.20529 [2010-12-03 10:03:06.26827] I [client-handshake.c:993:select_server_supported_programs] RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-12-03 10:03:06.27029] I [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1: Connected to 10.54.50.101:24009, attached to remote volume '/data'. [2010-12-03 10:03:06.27067] I [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2 fds open - Delaying child_up until they are re-opened Anyone else seen anything like this and/or have suggestions about options I can set to work around this? .. Lana (lana.deere at gmail.com) _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users