Read/write operations hang for long period of time (too long). I've seen it in that state (waiting) for something like 5 minutes, which makes every application fail trying to read or write. These are the Errors I found in the logs in the server A which is still accessible (B was down) etc-glusterfs-glusterd.vol.log ... [2014-01-31 07:56:49.780247] W [socket.c:1512:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Connection timed out), peer (<SERVER_B_IP>:24007) [2014-01-31 07:58:25.965783] E [socket.c:1715:socket_connect_finish] 0-management: connection to <SERVER_B_IP>:24007 failed (No route to host) [2014-01-31 08:59:33.923250] I [glusterd-handshake.c:397:glusterd_set_clnt_mgmt_program] 0-: Using Program glusterd mgmt, Num (1238433), Version (2) [2014-01-31 08:59:33.923289] I [glusterd-handshake.c:403:glusterd_set_clnt_mgmt_program] 0-: Using Program Peer mgmt, Num (1238437), Version (2) ... glustershd.log [2014-01-27 12:07:03.644849] W [socket.c:1512:__socket_proto_state_machine] 0-teoswitch_custom_music-client-1: reading from socket failed. Error (Connection timed out), peer (<SERVER_B_IP>:24010) [2014-01-27 12:07:03.644888] I [client.c:2090:client_rpc_notify] 0-teoswitch_custom_music-client-1: disconnected [2014-01-27 12:09:35.553628] E [socket.c:1715:socket_connect_finish] 0-teoswitch_greetings-client-1: connection to <SERVER_B_IP>:24011 failed (Connection timed out) [2014-01-27 12:10:13.588148] E [socket.c:1715:socket_connect_finish] 0-license_path-client-1: connection to <SERVER_B_IP>:24013 failed (Connection timed out) [2014-01-27 12:10:15.593699] E [socket.c:1715:socket_connect_finish] 0-upload_path-client-1: connection to <SERVER_B_IP>:24009 failed (Connection timed out) [2014-01-27 12:10:21.601670] E [socket.c:1715:socket_connect_finish] 0-teoswitch_ivr_greetings-client-1: connection to <SERVER_B_IP>:24012 failed (Connection timed out) [2014-01-27 12:10:23.607312] E [socket.c:1715:socket_connect_finish] 0-teoswitch_custom_music-client-1: connection to <SERVER_B_IP>:24010 failed (Connection timed out) [2014-01-27 12:11:21.866604] E [afr-self-heald.c:418:_crawl_proceed] 0-teoswitch_ivr_greetings-replicate-0: Stopping crawl as < 2 children are up [2014-01-27 12:11:21.867874] E [afr-self-heald.c:418:_crawl_proceed] 0-teoswitch_greetings-replicate-0: Stopping crawl as < 2 children are up [2014-01-27 12:11:21.868134] E [afr-self-heald.c:418:_crawl_proceed] 0-teoswitch_custom_music-replicate-0: Stopping crawl as < 2 children are up [2014-01-27 12:11:21.869417] E [afr-self-heald.c:418:_crawl_proceed] 0-license_path-replicate-0: Stopping crawl as < 2 children are up [2014-01-27 12:11:21.869659] E [afr-self-heald.c:418:_crawl_proceed] 0-upload_path-replicate-0: Stopping crawl as < 2 children are up [2014-01-27 12:12:53.948154] I [client-handshake.c:1636:select_server_supported_programs] 0-teoswitch_greetings-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2014-01-27 12:12:53.952894] I [client-handshake.c:1433:client_setvolume_cbk] 0-teoswitch_greetings-client-1: Connected to <SERVER_B_IP>:24011, attached to remote volume nfs.log there are lots of errors but the one that insist most Is this: [2014-01-27 12:12:27.136033] E [socket.c:1715:socket_connect_finish] 0-teoswitch_custom_music-client-1: connection to <SERVER_B_IP>:24010 failed (Connection timed out) Any ideas? From the logs I see nothing but confirm the fact that A cannot reach B which makes sense since B is down. But A is not, and it's volume should still be accesible. Right? Regards, Marco Marco Zanger Phone 54 11 5299-5400 (int. 5501) Clay 2954, C1426DLD, Buenos Aires, Argentina Think Green - Please do not print this email unless you really need to -----Original Message----- From: Vijay Bellur [mailto:vbellur@xxxxxxxxxx] Sent: lunes, 17 de febrero de 2014 01:21 p.m. To: Marco Zanger; gluster-users@xxxxxxxxxxx Subject: Re: Node down and volumes unreachable On 02/13/2014 08:06 PM, Marco Zanger wrote: > Hi all, > > I'm experiencing a strange issue related to both distribute and > replicate volumes. The problem is this: > > I have two servers, A and B. Both share some replicate volumes and > distribute volumes, like this: > > Volume Name: upload_path > > Type: Replicate > > Volume ID: 15ca11e2-206e-414d-8299-3ae20c54bd8a > > Status: Started > > Number of Bricks: 1 x 2 = 2 > > Transport-type: tcp > > Bricks: > > Brick1: <IP-A>:<some_path>/upload_path > > Brick2: <IP-B>: <some_path>/upload_path > > Each server mounts to self like this. In server A: > > glusterfs#<IP_A>:upload_path on <some_path>/upload_path type fuse > (rw,default_permissions,allow_other,max_read=131072) > > I've used both glusterfs and nfs for my tests, but when server B is > down (unreachable from A) we cannot access (nor read or write) the > volumes within A. By inaccessible state, do you refer to read/write operations hanging or erroring out? Does it stay forever in this inaccessible state? If you check your client log files around the time server B is unreachable from A, there might be some clues around this behavior. -Vijay _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users