I have been testing 3.1.2 over the last few days. My overall impression is that it resolved several bugs from 3.1.1, but the latest version is still prone to crashing under moderate to heavy loads. I was running some stress tests on a two server replicated setup today with ~150 clients connected with RDMA. The glusterfsd process crashed on one server. I waited about 30 minutes to see if the automatic fail-over would work, but I continued to receive "Transport: endpoint not connected" error messages on all the clients. I saw the following error messages in the server log: (I removed several hundred error messages from the following snippet) [2011-01-21 15:10:13.804308] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x66540x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport (rdma.supportdir-server) [2011-01-21 15:10:13.804314] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x64658x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport (rdma.supportdir-server) [2011-01-21 15:10:13.804342] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 15:10:13.804365] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 15:10:13.804636] I [server.c:428:server_rpc_notify] supportdir-server: disconnected connection from 192.168.50.7:1020 [2011-01-21 15:10:13.804702] I [server-helpers.c:670:server_connection_destroy] supportdir-server: destroyed connection of n7-12719-2011/01/19-17:36:59:497983-supportdir-client-0 [2011-01-21 15:10:13.805028] I [server.c:428:server_rpc_notify] supportdir-server: disconnected connection from 192.168.50.127:1020 [2011-01-21 15:10:13.805071] I [server-helpers.c:670:server_connection_destroy] supportdir-server: destroyed connection of n127-12567-2011/01/19-17:43:17:468018-supportdir-client-0 pending frames: patchset: v3.1.1-64-gf2a067c signal received: 11 time of crash: 2011-01-21 15:10:13 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.2 /lib64/libc.so.6(+0x32a60)[0x7fc2a7f64a60] /usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/xlator/protocol/server.so(server_release+0x54)[0x7fc2a4f05454] /usr/local/glusterfs/3.1.2/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x26f)[0x7fc2a88d25ef] /usr/local/glusterfs/3.1.2/lib/libgfrpc.so.0(rpcsvc_notify+0x123)[0x7fc2a88d2c23] /usr/local/glusterfs/3.1.2/lib/libgfrpc.so.0(rpc_transport_notify+0x2d)[0x7fc2a88d6a9d] /usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/rpc-transport/rdma.so(rdma_pollin_notify+0xd1)[0x7fc2a4ae68b1] /usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/rpc-transport/rdma.so(rdma_process_recv+0x14b)[0x7fc2a4ae6e8b] /usr/local/glusterfs/3.1.2/lib/glusterfs/3.1.2/rpc-transport/rdma.so(+0xb226)[0x7fc2a4ae7226] /lib64/libpthread.so.0(+0x6a4f)[0x7fc2a8298a4f] /lib64/libc.so.6(clone+0x6d)[0x7fc2a800282d] I think the crash is related to this bug: http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2197 I ran some smaller tests on a single server setup. The were ~50 clients connected via RDMA. While the jobs were running, several of them crashed with "File descriptor in bad state" or "Stale File Descriptor" errors. Here are the error messages from the server log: [2011-01-21 10:15:52.442908] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x16660x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport (rdma.maindir-server) [2011-01-21 10:15:52.443012] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x20251x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport (rdma.maindir-server) [2011-01-21 10:15:52.442949] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x77360x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport (rdma.maindir-server) [2011-01-21 10:15:52.443351] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x26495832x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 40) to rpc-transport (rdma.maindir-server) [2011-01-21 10:15:52.445247] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x25199x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport (rdma.maindir-server) [2011-01-21 10:15:52.445291] E [rpcsvc.c:1548:rpcsvc_submit_generic] rpc-service: failed to submit message (XID: 0x60907x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 27) to rpc-transport (rdma.maindir-server) [2011-01-21 10:15:52.447572] I [server.c:428:server_rpc_notify] maindir-server: disconnected connection from 192.168.50.116:1018 [2011-01-21 10:15:52.455116] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 10:15:52.455227] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 10:15:52.455325] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 10:15:52.455436] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 10:15:52.455896] I [server-helpers.c:670:server_connection_destroy] maindir-server: destroyed connection of n116-14977-2011/01/20-12:43:18:128066-maindir-client-0 [2011-01-21 10:15:52.455610] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 10:15:52.455659] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 10:15:52.455564] E [server.c:137:server_submit_reply] : Reply submission failed [2011-01-21 10:15:52.458581] I [server.c:428:server_rpc_notify] maindir-server: disconnected connection from 192.168.50.19:1018 [2011-01-21 10:15:52.458677] I [server-helpers.c:670:server_connection_destroy] maindir-server: destroyed connection of n19-15053-2011/01/20-12:38:13:243408-maindir-client-0 (I removed dozens of similar error message) The glusterfsd process did not crash in that instance. Jeremy Stout On Fri, Jan 21, 2011 at 6:49 AM, David Lloyd <david.lloyd at v-consultants.co.uk> wrote: > Hello, > > Haven't heard much feedback about installing glusterfs 3.1.2. > > Should I infer that it's all gone extremely very smoothly for everyone, or > is everyone being as cowardly as me and waiting for others to do it first? > > Cheers > David > > -- > David Lloyd > V Consultants > www.v-consultants.co.uk > tel: +44 7983 816501 > skype: davidlloyd1243 > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > >