Hello, we encounter break down of the gluster mount on a single client running wget download in parallel (70 threads with ~20MB/s in total on a 1Gb net). The other 23 clients did not drop the gluster mount. In the client logs we see "destroyed connection" messages. After force unmounting and mounting everything works fine again. We had a gluster rebalance running before the data download and stopped it manually so it won't interfere with the download op. I'am asking your experince if this could be related to network outages/overload ? Or is there something broken with the stopped rebalance op in 3.2.5 ? Would an upgrade to 3.3.1 improve the situation i.e. reconnect instead of "destroyed connection" and dangling mount points on the client ? Heiko Attached pls find the client log and logs of two affected bricks: gluster 3.2.5 rd28 ~ # gluster volume info all Volume Name: data Type: Distribute Status: Started Number of Bricks: 16 Transport-type: tcp Bricks: Brick1: rd29:/data Brick2: rd34:/data Brick3: rd28:/data Brick4: rd24:/data Brick5: rd26:/data Brick6: rd27:/data Brick7: rd21:/data Brick8: rd20:/data Brick9: rd22:/data Brick10: rd23:/data Brick11: rd30:/data Brick12: rd31:/data Brick13: rd32:/data Brick14: rd33:/data Brick15: rd25:/data Brick16: rd35:/data Options Reconfigured: nfs.port: 2049 cluster.min-free-disk: 5% network.ping-timeout: 24 nfs.export-volumes: on nfs.export-dir: /data nfs.disable: off performance.stat-prefetch: off ###### CLIENT (hc10): [2012-11-26 22:00:10.489854] C [client-handshake.c:121:rpc_client_ping_timer_expired] 0-data-client-2: server 192.168.16.138:24009 has not responded in the last 24 seconds, disconnecting. [2012-11-26 22:00:10.656750] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x6d) [0x7f0a06a8e50d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f0a06a8e1dd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f0a06a8e13e]))) 0-data-client-2: forced unwinding frame type(GlusterFS 3.1) op(RELEASE(41)) called at 2012-11-26 21:58:47.398947 [2012-11-26 22:00:10.656828] E [rpc-clnt.c:341:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x6d) [0x7f0a06a8e50d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) [0x7f0a06a8e1dd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f0a06a8e13e]))) 0-data-client-2: forced unwinding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2012-11-26 21:58:47.399011 [2012-11-26 22:00:10.656844] I [client3_1-fops.c:683:client3_1_writev_cbk] 0-data-client-2: remote operation failed: Transport endpoint is not connected [2012-11-26 22:00:10.657005] W [fuse-bridge.c:1828:fuse_writev_cbk] 0-glusterfs-fuse: 1634988442: WRITE => -1 (Transport endpoint is not connected) [2012-11-26 22:00:10.710760] I [socket.c:2275:socket_submit_request] 0-data-client-2: not connected (priv->connected = 0) [2012-11-26 22:00:10.710800] W [rpc-clnt.c:1417:rpc_clnt_submit] 0-data-client-2: failed to submit rpc-request (XID: 0x21514768x Program: GlusterFS 3.1, ProgVers: 310, Proc: 13) to rpc-transport (data-client-2) [2012-11-26 22:00:10.727969] I [client3_1-fops.c:683:client3_1_writev_cbk] 0-data-client-2: remote operation failed: Transport endpoint is not connected [2012-11-26 22:00:10.727991] W [client3_1-fops.c:3622:client3_1_writev] 0-data-client-2: failed to send the fop: Stale NFS file handle pending frames: frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) <snip><snap> frame : type(1) op(WRITE) frame : type(1) op(WRITE) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-11-26 22:00:10 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.2.5 /lib64/libc.so.6(+0x35b80)[0x7f0a060f0b80] /usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_sync_cbk+0x30)[0x7f0a02fa3550] /usr/lib/glusterfs/3.2.5/xlator/cluster/distribute.so(dht_writev_cbk+0xd3)[0x7f0a031bbd63] /usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_writev+0x12a)[0x7f0a034054ca] /usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client_writev+0xa1)[0x7f0a033e9921] /usr/lib/glusterfs/3.2.5/xlator/cluster/distribute.so(dht_writev+0x162)[0x7f0a031c0c42] /usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_sync+0x569)[0x7f0a02f9c929] /usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_do_ops+0x53)[0x7f0a02fa0953] /usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_process_queue+0xf2)[0x7f0a02f9dcc2] /usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_sync_cbk+0xf7)[0x7f0a02fa3617] /usr/lib/glusterfs/3.2.5/xlator/cluster/distribute.so(dht_writev_cbk+0xd3)[0x7f0a031bbd63] /usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_writev_cbk+0x507)[0x7f0a03401797] /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1ca)[0x7f0a06a8e0ba] /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f0a06a8e13e] /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d)[0x7f0a06a8e1dd] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x6d)[0x7f0a06a8e50d] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f0a06a8aaa8] /usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_err+0x54)[0x7f0a04447004] /usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0x138)[0x7f0a0444ca38] /usr/lib64/libglusterfs.so.0(+0x3ed4e)[0x7f0a06cd4d4e] /usr/sbin/glusterfs(main+0x2a9)[0x406689] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0a060dd09d] /usr/sbin/glusterfs[0x403b89] ##### BRICK (data-client-2): [2012-11-26 22:00:17.19582] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.data-server: reading from socket failed. Error (Transport endpoint is not connected), peer (192.168.16.167:1022) [2012-11-26 22:00:17.60962] I [server.c:438:server_rpc_notify] 0-data-server: disconnected connection from 192.168.16.167:1022 [2012-11-26 22:00:32.950134] W [socket.c:204:__socket_rwv] 0-tcp.data-server: readv failed (Connection reset by peer) [2012-11-26 22:00:32.950178] W [socket.c:775:__socket_read_simple_msg] 0-tcp.data-server: reading from socket failed. Error (Connection reset by peer), peer (192.168.16.167:999) [2012-11-26 22:00:32.962622] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120910/TREW/TREW-Band-15-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120910_00002.tar [2012-11-26 22:00:32.962642] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/download_logs/wget_Date_20120911_Band_13.log [2012-11-26 22:00:32.962792] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120910/TREW/TREW-Band-05-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120910_00002.tar [2012-11-26 22:00:32.962809] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120909/TREW/TREW-Band-09-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120909_00004.tar [2012-11-26 22:00:32.962893] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/download_logs/wget_Date_20120908_Band_07.log [2012-11-26 22:00:32.962907] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/download_logs/wget_Date_20120908_Band_02.log [2012-11-26 22:00:32.962945] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /HGK/L2/inputdata/sacspecTOTL08_758.dat [2012-11-26 22:00:32.962970] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120908/TREW/TREW-Band-04-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120908_00002.tar [2012-11-26 22:00:32.962985] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120908/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120908_00004.tar [2012-11-26 22:00:32.963011] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120907/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120907_00004.tar [2012-11-26 22:00:32.971352] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /HGK/L2/inputdata/sacspecTOTL08_758.dat [2012-11-26 22:00:32.971367] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/download_logs/wget_Date_20120909_Band_07.log [2012-11-26 22:00:32.971393] I [server.c:438:server_rpc_notify] 0-data-server: disconnected connection from 192.168.16.167:999 [2012-11-26 22:00:32.971418] I [server-helpers.c:783:server_connection_destroy] 0-data-server: destroyed connection of hc10-17275-2012/11/24-20:57:47:169834-data-client-2 ##### BRICK (data-client-11): [2012-11-26 22:00:17.22727] W [socket.c:204:__socket_rwv] 0-tcp.data-server: readv failed (Connection reset by peer) [2012-11-26 22:00:17.56177] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.data-server: reading from socket failed. Error (Connection reset by peer), peer (192.168.16.167:1000) [2012-11-26 22:00:17.66354] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/download_logs/wget_Date_20120911_Band_06.log [2012-11-26 22:00:17.66377] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120909/TREW/TREW-Band-04-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120909_00002.tar [2012-11-26 22:00:17.66496] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120911/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120911_00004.tar [2012-11-26 22:00:17.66528] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/20120910/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120910_00004.tar [2012-11-26 22:00:17.66542] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/download_logs/wget_Date_20120910_Band_14.log [2012-11-26 22:00:17.66558] I [server-helpers.c:485:do_fd_cleanup] 0-data-server: fd cleanup on /TREW/download_logs/wget_Date_20120908_Band_10.log [2012-11-26 22:00:17.66574] I [server.c:438:server_rpc_notify] 0-data-server: disconnected connection from 192.168.16.167:1000 [2012-11-26 22:00:17.66973] I [server-helpers.c:783:server_connection_destroy] 0-data-server: destroyed connection of hc10-17275-2012/11/24-20:57:47:169834-data-client-11