HA problems

chawkins at bplinux.com (Christopher Hawkins) · Fri, 7 May 2010 12:47:09 -0400 (EDT)

Hello, I have a problem now that was previously solved. In a simple setup with two servers and one client, the way I had things configured was that the client connected to a virtual IP that could fail back and forth to whatever server was available. This used to work. But I have not tested since 2.09 until today... And now instead of recovering after a brief timeout, the client never recovers and reports endless Stale NFS File handle errors in its log (though there is no NFS involved, just native gluster client). 

So I tried the HA translator from testing. This also does not work. After I kill the primary server (listed first in the config file), an ls of the mount point hangs for a moment and then reports:

[root at server2 glusterfs]# ls /mnt/test
ls: /mnt/test: Input/output error

Each attempted ls produces two errors in the client log as well, a "Transport endpoint is not connected" error followed by the "Input/output error". 

The client log shows this:

[2010-05-07 12:03:44] N [glusterfsd.c:1408:main] glusterfs: Successfully started
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] master2_root: Connected to 192.168.1.92:3399, attached to remote volume 'threads1'.
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.91:3399, attached to remote volume 'threads1'.
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] master2_root: Connected to 192.168.1.92:3399, attached to remote volume 'threads1'.
[2010-05-07 12:03:44] N [fuse-bridge.c:2950:fuse_init] glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.10
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.91:3399, attached to remote volume 'threads1'.

[....here I killed the primary server....]

[2010-05-07 12:06:17] E [client-protocol.c:415:client_ping_timer_expired] master_root: Server 192.168.1.91:3399 has not responded in the last 42 seconds, disconnecting.
[2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind] master_root: forced unwinding frame type(1) op(LOOKUP)
[2010-05-07 12:06:17] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:17] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 10: LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind] master_root: forced unwinding frame type(2) op(PING)
[2010-05-07 12:06:17] N [client-protocol.c:6994:notify] master_root: disconnected
[2010-05-07 12:06:17] E [socket.c:762:socket_connect_finish] master_root: connection to 192.168.1.91:3399 failed (No route to host)
[2010-05-07 12:06:21] E [socket.c:762:socket_connect_finish] master_root: connection to 192.168.1.91:3399 failed (No route to host)
[2010-05-07 12:06:21] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:21] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 11: LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:24] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:24] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 12: LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:26] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:26] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 13: LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:39] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:39] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 14: LOOKUP() / => -1 (Input/output error)

[.... here I powered the primary server back on....] 

[2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.91:3399, attached to remote volume 'threads1'.
[2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.91:3399, attached to remote volume 'threads1'.
--------- end log ------------

And after it came back, the client recovered and everything picked back up. But it seems I cannot get the client to consider any server other than the first one it connects to. I assume that if failing the primary servers IP address to another box doesn't work, then round robin DNS will also not work since they are essentially the same method (a different server with the same address). And since this used to work, this seems to be an unintended result. 

The server vol file has a single export and io-threads, and the client has just the two remote-subvolumes and the ha declaration like so:

volume ha
   type cluster/ha
   subvolumes master_root master2_root
end-volume 

Code base is Glusterfs version 3.04 compiled from source this morning. How can I troubleshoot?

Christopher Hawkins