HA translator failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

In testing the HA translator under 2.0.0rc1, i've managed to create a 
simple and reproducible scenario in which Gluster fails to maintain 
communication between the client and the server(s).

Server01 and Server02 are AFR'ing each other, with Client01 connected 
via the HA translator.  As a simple test, i launch a script that echoes 
an increasing counter to a text file in the Gluster mount on Client01. 
Client01 is communicating with Server01 in this instance.

I cleanly stop glusterfsd on Server01, and after a momentary hiccup 
(noted in the log excerpt below), things continue to function as 
expected - Client01 commences communication with Server02.  So far so good.

2009-01-15 15:54:19 E [socket.c:708:socket_connect_finish] export01: 
connection failed (Connection refused)

I re-start glusterfsd on Server01, then, i cleanly stop glusterfsd on 
Server02 (which, of course, Client01 is now communicating with). 
Client01 freaks out (see log excerpt below), does /not/ attempt to 
contact Server01 again, and leaves me with the dreaded "transport 
endpoint not connected" situation.

2009-01-15 16:06:02 E [ha-helpers.c:266:_ha_next_active_child_for_ctx] 
export-ha: none of the children are connected other than export02
2009-01-15 16:06:02 E [ha.c:2715:ha_fstat_cbk] export-ha: no active 
subvolume
2009-01-15 16:06:02 E [fuse-bridge.c:533:fuse_attr_cbk] glusterfs-fuse: 
2932: FSTAT() /counter.txt => -1 (Transport endpoint is not connected)

Client01 sometimes recovers from this, and sometimes it does not.  When 
it does not recover from this situation, the only solution is manual 
intervention (unmount / remount).  That's not the worst of it, though : 
when it /does/ recover, re-starting glusterfsd on Server02 (!) causes 
even more of the errors (see below), and /always/ results in a total 
failure on Client01 within a second or two (transport endpoint not 
connected).  Client01 never recovers from this.

2009-01-15 19:04:56 E [ha-helpers.c:266:_ha_next_active_child_for_ctx] 
export-ha: none of the children are connected other than export01
2009-01-15 19:04:56 E [ha.c:2515:ha_flush_cbk] export-ha: no active 
subvolume
2009-01-15 19:04:56 E [fuse-bridge.c:911:fuse_err_cbk] glusterfs-fuse: 
3058: FLUSH() ERR => -1 (Transport endpoint is not connected)


I strongly suspect this is not the expected behaviour of the High 
Availability translator. :)


Servers are running FC9 i386, Client is FC10 i386.

# glusterfs --version
glusterfs 2.0.0rc1 built on Jan 14 2009 13:19:06
Repository revision: glusterfs--mainline--3.0--patch-844

# rpm -qa | grep fuse
fuse-2.7.3glfs10-1.i386
fuse-devel-2.7.3glfs10-1.i386
fuse-libs-2.7.3glfs10-1.i386


Server config :

# cat /etc/glusterfs/glusterfs-server.vol
# dataspace
volume test-ds
   type storage/posix
   option directory /opt/datadir
end-volume

# posix locks for test-ds
volume test-ds-locks
   type features/locks
   option mandatory-locks on
   subvolumes test-ds
end-volume

# dataspace of test-ds on Server01
   volume test-01-ds
   type protocol/client
   option transport-type tcp/client
   option remote-host 192.168.0.183
   option remote-subvolume test-ds-locks
   option transport-timeout 10
end-volume

# automatic file replication translator for test dataspace
volume test-ds-afr
   type cluster/afr
   subvolumes test-ds-locks test-01-ds
end-volume

# the actual export
volume export
   type performance/io-threads
   option thread-count 8
   subvolumes test-ds-afr
end-volume

# server declaration
volume server
   type protocol/server
   option transport-type tcp/server
   subvolumes export
   option auth.addr.export.allow 
192.168.0.73,192.168.0.183,192.168.0.166,127.0.0.1
   option auth.addr.test-ds-locks.allow 
192.168.0.73,192.168.0.183,192.168.0.166,127.0.0.1
end-volume



client config :
# cat /etc/glusterfs/glusterfs-client.vol

# export on Server01
volume export01
   type protocol/client
   option transport-type tcp/client
   option remote-host 192.168.0.183
   option remote-subvolume export      # exported volume
end-volume

# export on Server02
volume export02
   type protocol/client
   option transport-type tcp/client
   option remote-host 192.168.0.166
   option remote-subvolume export      # exported volume
end-volume

# exports clustered via HA
volume export-ha
   type cluster/ha
   subvolumes export01 export02
end-volume



-- 
Daniel Maher <dma+gluster AT witbe DOT net>



[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux