Hi devs et al, After many hours of sublimation, I was able to condense my previous hanging issue down to this simplest case. To summarize: I have two physical machines, each afr'ing a directory to the other. both are glusterfs(d) 1.3.1 with glfs3 fuse. iptables is suspended during these tests. Spec files are below. The four situations: 1) If I start up both machines and start up glusterfsd on both machines, I can mount either one from the other and view its files as expected. 2) If I start up only one machine and glusterfsd, I can mount that glusterfsd brick from the same machine and use it (ie edit the files) while it tries to connect to the 2nd machine in the background. When I bring up the 2nd machine, it connects and afrs as expected. Compare this to #4). 3) If I start up both machines and glusterfsd on both, mount each others' bricks, verify I can see the files and then kill glusterfsd on one of them, I can still use and view files on the other one while it tries to reconnect in the background to the glusterfsd that was killed. When it comes back up everything continues as expected. 4) But, if I startup both machines with glusterfsd on both, mount either brick and view the files and then bring down the other machine (ie not kill glusterfsd, but bring down the whole machine suddenly, or pull the ethernet cable) , I can no longer see any files on the remaining machine. It just hangs until the machine that is down comes back up and then it continues on its merry way. This is presumably not the expected behavior since it is not the behavior in 2) and 3). It is only after the machines have both started up and then one of them goes away that I see this problem. Obviously, however this is the very situation that calls for an HA setup in the real world. When one server goes offline suddenly, you want to be able to keep on using the first. Here is the simplest spec file configuration that exhibits this problem: Simple server configuration: volume brick-ds type storage/posix option directory /.brick-ds end-volume volume brick-ds-afr type storage/posix option directory /.brick-ds-afr end-volume volume server type protocol/server option transport-type tcp/server option bind-address 192.168.16.128 # 192.168.16.1 on the other server subvolumes brick-ds brick-ds-afr option auth.ip.brick-ds.allow 192.168.16.* option auth.ip.brick-ds-afr.allow 192.168.16.* end-volume Client Configuration : volume brick-ds-local type protocol/client option transport-type tcp/client option remote-host 192.168.16.128 # 192.168.16.1 on the other machine option remote-subvolume brick-ds end-volume volume brick-ds-remote type protocol/client option transport-type tcp/client option remote-host 192.168.16.1 # 192.168.16.128 on the other machine option remote-subvolume brick-ds-afr end-volume volume brick-ds-afr type cluster/afr subvolumes brick-ds-local brick-ds-remote option replicate *:2 end-volume These are both stock CentOS/RHEL 5 machines. You can demonstrate the behavior by rebooting one machine, pulling out the ethernet cable, or sending the route out into space (ie route add -host 192.168.16.1some_disconnected_device). Everything will be frozen until the connection returns and then when it comes back up, things keep working again after that. Because of this problem, any kind of HA / unify setup will not work for me when one of the nodes fails. Can someone else verify this behavior? If there is some part of the logs / strace / gdb output you'd like to see , just let me know. I'd really like to use glusterfs in an HA setup, but don't see how with this behavior. Thanks in advance!! :august On 9/7/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote: > > > Hi all - > > I have a setup based on this : > > http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS<http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS> > but with only 2 machines. Effectively just a mirror (glusterfsd > configuration below). 1.3.1 client and server. > >