Hi August, We try to keep this link in sync with the changes done in translators options: http://gluster.org/docs/index.php/GlusterFS_Translators 'doc/translator-options.txt' is present only in tla archives, but is not included in release tarball. I will try with getting the afr's self-heal design up in wiki. Thanks and regards, Amar On 9/11/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote: > > Hi all - > > After combing through the archives, I found the transport-timeout > option mentioned by avati. Is this described in the wiki docs > anywhere? I thought I had read through every page, but don't recall > seeing it. The e-mail from avati mentioned that it was described in > "doc/translator-options.txt" but this file does not appear in my > glusterfs-1.3.1 tarball. > > In any case, for those who have similar issues, making transport > timeout much smaller is your friend :-) > > Many Thanks!! > :august > > On 9/10/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote: > > Hi devs et al, > > > > After many hours of sublimation, I was able to condense my previous > hanging > > issue down to this simplest case. > > > > To summarize: I have two physical machines, each afr'ing a directory to > the > > other. both are glusterfs(d) 1.3.1 with glfs3 fuse. iptables is > suspended > > during these tests. Spec files are below. > > > > The four situations: > > > > 1) If I start up both machines and start up glusterfsd on both machines, > I > > can mount either one from the other and view its files as expected. > > > > 2) If I start up only one machine and glusterfsd, I can mount that > > glusterfsd brick from the same machine and use it (ie edit the files) > while > > it tries to connect to the 2nd machine in the background. When I bring > up > > the 2nd machine, it connects and afrs as expected. Compare this to #4). > > > > 3) If I start up both machines and glusterfsd on both, mount each > others' > > bricks, verify I can see the files and then kill glusterfsd on one of > them, > > I can still use and view files on the other one while it tries to > reconnect > > in the background to the glusterfsd that was killed. When it comes back > up > > everything continues as expected. > > > > 4) But, if I startup both machines with glusterfsd on both, mount either > > brick and view the files and then bring down the other machine (ie not > kill > > glusterfsd, but bring down the whole machine suddenly, or pull the > ethernet > > cable) , I can no longer see any files on the remaining machine. It just > > hangs until the machine that is down comes back up and then it continues > on > > its merry way. > > > > This is presumably not the expected behavior since it is not the > behavior in > > 2) and 3). It is only after the machines have both started up and then > one > > of them goes away that I see this problem. Obviously, however this is > the > > very situation that calls for an HA setup in the real world. When one > server > > goes offline suddenly, you want to be able to keep on using the first. > > > > Here is the simplest spec file configuration that exhibits this problem: > > > > Simple server configuration: > > > > volume brick-ds > > type storage/posix > > option directory /.brick-ds > > end-volume > > > > volume brick-ds-afr > > type storage/posix > > option directory /.brick-ds-afr > > end-volume > > > > volume server > > type protocol/server > > option transport-type tcp/server > > option bind-address 192.168.16.128 # 192.168.16.1 on the other > server > > subvolumes brick-ds brick-ds-afr > > option auth.ip.brick-ds.allow 192.168.16.* > > option auth.ip.brick-ds-afr.allow 192.168.16.* > > end-volume > > > > > > Client Configuration : > > > > volume brick-ds-local > > type protocol/client > > option transport-type tcp/client > > option remote-host 192.168.16.128 # 192.168.16.1 on the other > machine > > option remote-subvolume brick-ds > > end-volume > > > > volume brick-ds-remote > > type protocol/client > > option transport-type tcp/client > > option remote-host 192.168.16.1 # 192.168.16.128 on the other > machine > > option remote-subvolume brick-ds-afr > > end-volume > > > > volume brick-ds-afr > > type cluster/afr > > subvolumes brick-ds-local brick-ds-remote > > option replicate *:2 > > end-volume > > > > These are both stock CentOS/RHEL 5 machines. You can demonstrate the > > behavior by rebooting one machine, pulling out the ethernet cable, or > > sending the route out into space (ie route add -host 192.168.16.1 > > some_disconnected_device). Everything will be frozen until the > connection > > returns and then when it comes back up, things keep working again after > > that. > > > > Because of this problem, any kind of HA / unify setup will not work for > me > > when one of the nodes fails. > > > > Can someone else verify this behavior? If there is some part of the logs > / > > strace / gdb output you'd like to see , just let me know. I'd really > like to > > use glusterfs in an HA setup, but don't see how with this behavior. > > > > Thanks in advance!! > > :august > > > > > > On 9/7/07, August R. Wohlt < glusterfs@xxxxxxxxxxx> wrote: > > > > > > Hi all - > > > > > > I have a setup based on this : > > > > > > http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS > > > but with only 2 machines. Effectively just a mirror (glusterfsd > > configuration below). 1.3.1 client and server. > > > > > > > > > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > http://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Amar Tumballi Engineer - Gluster Core Team [bulde on #gluster/irc.gnu.org] http://www.zresearch.com - Commoditizing Supercomputing and Superstorage!