One other observation is that it seems to be genuinely related to the number of nodes involved. If I run, say, 50 instances of my script using 50 separate nodes, then they almost always generate some failures. If I run the same number of instances, or even a much greater number, but using only 10 separate nodes, then they seem always to work OK. Maybe this is due to some kind of caching behaviour? .. Lana (lana.deere at gmail.com) On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere <lana.deere at gmail.com> wrote: > The gluster configuration is distribute, there are 4 server nodes. > > There are 53 physical client nodes in my setup, each with 8 cores; we > want to sometimes run more than 400 client processes simultaneously. > In practice we aren't yet trying that many. > > When I run the commands which break, I am running them on separate > clients simultaneously. > ? ?for host in <hosts>; do ssh $host script& done ?# Note the & > When I run on 25 clients simultaneously so far I have not seen it > fail. ?But if I run on 40 or 50 simultaneously it often fails. > > Sometimes I have run more than one command on each client > simultaneously by listing all the hosts multiple times in the > for-loop, > ? for host in <hosts> <hosts> <hosts>; do ssh $host script& done > In example of 3 at a time I have noticed that when a host works, all > three on that client will work; but when it fails, all three will fail > exactly the same fashion. > > I've attached a tarfile containing two sets of logs. ?In both cases I > had rotated all the log files and rebooted everything then run my > test. ?In the first set of logs, I went directly to approx. 50 > simultaneous sessions, and pretty much all of them just hung. ?(When > the find hangs, even a kill -9 will not unhang it.) ?So I rotated the > logs again and rebooted everything, but this time I gradually worked > my way up to higher loads. ?This time the failures were mostly cases > with the wrong checksum but no error message, though some of them did > give me errors like > ? ?find: lib/kbd/unimaps/cp865.uni: Invalid argument > > Thanks. ?I may try downgrading to 3.1.0 just to see if I have the same > problem there. > > > .. Lana (lana.deere at gmail.com) > > > > > > > On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G <raghavendra at gluster.com> wrote: >> Hi Lana, >> >> I need some clarifications about test setup: >> >> * Are you seeing problem when there are more than 25 clients? If this is the case, are these clients on different physical nodes or is it that more than one client shares same node? In other words, clients are mounted on how many physical nodes are there in your test setup? Also, are you running the command on each of these clients simultaneously? >> >> * Or is it that there are more than 25 concurrent concurrent invocations of the script? If this is the case, how many clients are present in your test setup and on how many physical nodes these clients are mounted? >> >> regards, >> ----- Original Message ----- >> From: "Lana Deere" <lana.deere at gmail.com> >> To: gluster-users at gluster.org >> Sent: Saturday, December 4, 2010 12:13:30 AM >> Subject: 3.1.1 crashing under moderate load >> >> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA >> transport, native/fuse access. >> >> I have a directory which is shared on the gluster. ?In fact, it is a clone >> of /lib from one of the clients, shared so all can see it. >> >> I have a script which does >> ? ?find lib -type f -print0 | xargs -0 sum | md5sum >> >> If I run this on my clients one at a time, they all yield the same md5sum: >> ? ?for h in <<hosts>>; do ssh $host script; done >> >> If I run this on my clients concurrently, up to roughly 25 at a time they >> still yield the same md5sum. >> ? ?for h in <<hosts>>; do ssh $host script& done >> >> Beyond that the gluster share often, but not always, fails. ?The errors vary. >> ? ?- sometimes I get "sum: xxx.so not found" >> ? ?- sometimes I get the wrong checksum without any error message >> ? ?- sometimes the job simply hangs until I kill it >> >> >> Some of the server logs show messages like these from the time of the >> failures (other servers show nothing from around that time): >> >> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler] >> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp >> socket (peer: 10.54.255.240:1022) after handshake is complete >> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic] >> rpc-service: failed to submit message (XID: 0x55e82, Program: >> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport >> (rdma.RaidData-server) >> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] : >> Reply submission failed >> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic] >> rpc-service: failed to submit message (XID: 0x55e83, Program: >> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport >> (rdma.RaidData-server) >> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] : >> Reply submission failed >> >> >> On a client which had a failure I see messages like: >> >> [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler] >> rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket >> (peer: 10.54.50.101:24009) after handshake is complete >> [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) >> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) >> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) >> op(READ(12)) called at 2010-12-03 10:03:06.20492 >> [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) >> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) >> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) >> op(READ(12)) called at 2010-12-03 10:03:06.20529 >> [2010-12-03 10:03:06.26827] I >> [client-handshake.c:993:select_server_supported_programs] >> RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437), >> Version (310) >> [2010-12-03 10:03:06.27029] I >> [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1: >> Connected to 10.54.50.101:24009, attached to remote volume '/data'. >> [2010-12-03 10:03:06.27067] I >> [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2 >> fds open - Delaying child_up until they are re-opened >> >> >> Anyone else seen anything like this and/or have suggestions about options I can >> set to work around this? >> >> >> .. Lana (lana.deere at gmail.com) >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >> >