Hi all - After combing through the archives, I found the transport-timeout option mentioned by avati. Is this described in the wiki docs anywhere? I thought I had read through every page, but don't recall seeing it. The e-mail from avati mentioned that it was described in "doc/translator-options.txt" but this file does not appear in my glusterfs-1.3.1 tarball. In any case, for those who have similar issues, making transport timeout much smaller is your friend :-) Many Thanks!! :august On 9/10/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote: > Hi devs et al, > > After many hours of sublimation, I was able to condense my previous hanging > issue down to this simplest case. > > To summarize: I have two physical machines, each afr'ing a directory to the > other. both are glusterfs(d) 1.3.1 with glfs3 fuse. iptables is suspended > during these tests. Spec files are below. > > The four situations: > > 1) If I start up both machines and start up glusterfsd on both machines, I > can mount either one from the other and view its files as expected. > > 2) If I start up only one machine and glusterfsd, I can mount that > glusterfsd brick from the same machine and use it (ie edit the files) while > it tries to connect to the 2nd machine in the background. When I bring up > the 2nd machine, it connects and afrs as expected. Compare this to #4). > > 3) If I start up both machines and glusterfsd on both, mount each others' > bricks, verify I can see the files and then kill glusterfsd on one of them, > I can still use and view files on the other one while it tries to reconnect > in the background to the glusterfsd that was killed. When it comes back up > everything continues as expected. > > 4) But, if I startup both machines with glusterfsd on both, mount either > brick and view the files and then bring down the other machine (ie not kill > glusterfsd, but bring down the whole machine suddenly, or pull the ethernet > cable) , I can no longer see any files on the remaining machine. It just > hangs until the machine that is down comes back up and then it continues on > its merry way. > > This is presumably not the expected behavior since it is not the behavior in > 2) and 3). It is only after the machines have both started up and then one > of them goes away that I see this problem. Obviously, however this is the > very situation that calls for an HA setup in the real world. When one server > goes offline suddenly, you want to be able to keep on using the first. > > Here is the simplest spec file configuration that exhibits this problem: > > Simple server configuration: > > volume brick-ds > type storage/posix > option directory /.brick-ds > end-volume > > volume brick-ds-afr > type storage/posix > option directory /.brick-ds-afr > end-volume > > volume server > type protocol/server > option transport-type tcp/server > option bind-address 192.168.16.128 # 192.168.16.1 on the other server > subvolumes brick-ds brick-ds-afr > option auth.ip.brick-ds.allow 192.168.16.* > option auth.ip.brick-ds-afr.allow 192.168.16.* > end-volume > > > Client Configuration : > > volume brick-ds-local > type protocol/client > option transport-type tcp/client > option remote-host 192.168.16.128 # 192.168.16.1 on the other machine > option remote-subvolume brick-ds > end-volume > > volume brick-ds-remote > type protocol/client > option transport-type tcp/client > option remote-host 192.168.16.1 # 192.168.16.128 on the other machine > option remote-subvolume brick-ds-afr > end-volume > > volume brick-ds-afr > type cluster/afr > subvolumes brick-ds-local brick-ds-remote > option replicate *:2 > end-volume > > These are both stock CentOS/RHEL 5 machines. You can demonstrate the > behavior by rebooting one machine, pulling out the ethernet cable, or > sending the route out into space (ie route add -host 192.168.16.1 > some_disconnected_device). Everything will be frozen until the connection > returns and then when it comes back up, things keep working again after > that. > > Because of this problem, any kind of HA / unify setup will not work for me > when one of the nodes fails. > > Can someone else verify this behavior? If there is some part of the logs / > strace / gdb output you'd like to see , just let me know. I'd really like to > use glusterfs in an HA setup, but don't see how with this behavior. > > Thanks in advance!! > :august > > > On 9/7/07, August R. Wohlt < glusterfs@xxxxxxxxxxx> wrote: > > > > Hi all - > > > > I have a setup based on this : > > > http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS > > but with only 2 machines. Effectively just a mirror (glusterfsd > configuration below). 1.3.1 client and server. > > > > > >