Hi all, In testing the HA translator under 2.0.0rc1, i've managed to create a simple and reproducible scenario in which Gluster fails to maintain communication between the client and the server(s). Server01 and Server02 are AFR'ing each other, with Client01 connected via the HA translator. As a simple test, i launch a script that echoes an increasing counter to a text file in the Gluster mount on Client01. Client01 is communicating with Server01 in this instance. I cleanly stop glusterfsd on Server01, and after a momentary hiccup (noted in the log excerpt below), things continue to function as expected - Client01 commences communication with Server02. So far so good. 2009-01-15 15:54:19 E [socket.c:708:socket_connect_finish] export01: connection failed (Connection refused) I re-start glusterfsd on Server01, then, i cleanly stop glusterfsd on Server02 (which, of course, Client01 is now communicating with). Client01 freaks out (see log excerpt below), does /not/ attempt to contact Server01 again, and leaves me with the dreaded "transport endpoint not connected" situation. 2009-01-15 16:06:02 E [ha-helpers.c:266:_ha_next_active_child_for_ctx] export-ha: none of the children are connected other than export02 2009-01-15 16:06:02 E [ha.c:2715:ha_fstat_cbk] export-ha: no active subvolume 2009-01-15 16:06:02 E [fuse-bridge.c:533:fuse_attr_cbk] glusterfs-fuse: 2932: FSTAT() /counter.txt => -1 (Transport endpoint is not connected) Client01 sometimes recovers from this, and sometimes it does not. When it does not recover from this situation, the only solution is manual intervention (unmount / remount). That's not the worst of it, though : when it /does/ recover, re-starting glusterfsd on Server02 (!) causes even more of the errors (see below), and /always/ results in a total failure on Client01 within a second or two (transport endpoint not connected). Client01 never recovers from this. 2009-01-15 19:04:56 E [ha-helpers.c:266:_ha_next_active_child_for_ctx] export-ha: none of the children are connected other than export01 2009-01-15 19:04:56 E [ha.c:2515:ha_flush_cbk] export-ha: no active subvolume 2009-01-15 19:04:56 E [fuse-bridge.c:911:fuse_err_cbk] glusterfs-fuse: 3058: FLUSH() ERR => -1 (Transport endpoint is not connected) I strongly suspect this is not the expected behaviour of the High Availability translator. :) Servers are running FC9 i386, Client is FC10 i386. # glusterfs --version glusterfs 2.0.0rc1 built on Jan 14 2009 13:19:06 Repository revision: glusterfs--mainline--3.0--patch-844 # rpm -qa | grep fuse fuse-2.7.3glfs10-1.i386 fuse-devel-2.7.3glfs10-1.i386 fuse-libs-2.7.3glfs10-1.i386 Server config : # cat /etc/glusterfs/glusterfs-server.vol # dataspace volume test-ds type storage/posix option directory /opt/datadir end-volume # posix locks for test-ds volume test-ds-locks type features/locks option mandatory-locks on subvolumes test-ds end-volume # dataspace of test-ds on Server01 volume test-01-ds type protocol/client option transport-type tcp/client option remote-host 192.168.0.183 option remote-subvolume test-ds-locks option transport-timeout 10 end-volume # automatic file replication translator for test dataspace volume test-ds-afr type cluster/afr subvolumes test-ds-locks test-01-ds end-volume # the actual export volume export type performance/io-threads option thread-count 8 subvolumes test-ds-afr end-volume # server declaration volume server type protocol/server option transport-type tcp/server subvolumes export option auth.addr.export.allow 192.168.0.73,192.168.0.183,192.168.0.166,127.0.0.1 option auth.addr.test-ds-locks.allow 192.168.0.73,192.168.0.183,192.168.0.166,127.0.0.1 end-volume client config : # cat /etc/glusterfs/glusterfs-client.vol # export on Server01 volume export01 type protocol/client option transport-type tcp/client option remote-host 192.168.0.183 option remote-subvolume export # exported volume end-volume # export on Server02 volume export02 type protocol/client option transport-type tcp/client option remote-host 192.168.0.166 option remote-subvolume export # exported volume end-volume # exports clustered via HA volume export-ha type cluster/ha subvolumes export01 export02 end-volume -- Daniel Maher <dma+gluster AT witbe DOT net>