Re: Re: afr :2 HA setup question

"Amar S. Tumballi" <amar@xxxxxxxxxxxxx> · Tue, 11 Sep 2007 13:13:32 +0530

Hi August,
 We try to keep this link in sync with the changes done in translators
options:
http://gluster.org/docs/index.php/GlusterFS_Translators

'doc/translator-options.txt' is present only in tla archives, but is not
included in release tarball.

I will try with getting the afr's self-heal design up in wiki.

Thanks and regards,
Amar

On 9/11/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote:
>
> Hi all -
>
> After combing through the archives, I found the transport-timeout
> option mentioned by avati. Is this described in the wiki docs
> anywhere? I thought I had read through every page, but don't recall
> seeing it. The e-mail from avati mentioned that it was described in
> "doc/translator-options.txt" but this file does not appear in my
> glusterfs-1.3.1 tarball.
>
> In any case, for those who have similar issues, making transport
> timeout much smaller is your friend :-)
>
> Many Thanks!!
> :august
>
> On 9/10/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote:
> > Hi devs et al,
> >
> > After many hours of sublimation, I was able to condense my previous
> hanging
> > issue down to this simplest case.
> >
> > To summarize: I have two physical machines, each afr'ing a directory to
> the
> > other. both are glusterfs(d) 1.3.1 with glfs3 fuse. iptables is
> suspended
> > during these tests. Spec files are below.
> >
> > The four situations:
> >
> > 1) If I start up both machines and start up glusterfsd on both machines,
> I
> > can mount either one from the other and view its files as expected.
> >
> > 2) If I start up only one machine and glusterfsd, I can mount that
> > glusterfsd brick from the same machine and use it (ie edit the files)
> while
> > it tries to connect to the 2nd machine in the background. When I bring
> up
> > the 2nd machine, it connects and afrs as expected. Compare this to #4).
> >
> > 3) If I start up both machines and glusterfsd on both, mount each
> others'
> > bricks, verify I can see the files and then kill glusterfsd on one of
> them,
> > I can still use and view files on the other one while it tries to
> reconnect
> > in the background to the glusterfsd that was killed. When it comes back
> up
> > everything continues as expected.
> >
> > 4) But, if I startup both machines with glusterfsd on both, mount either
> > brick and view the files and then bring down the other machine (ie not
> kill
> > glusterfsd, but bring down the whole machine suddenly, or pull the
> ethernet
> > cable) , I can no longer see any files on the remaining machine. It just
> > hangs until the machine that is down comes back up and then it continues
> on
> > its merry way.
> >
> > This is presumably not the expected behavior since it is not the
> behavior in
> > 2) and 3). It is only after the machines have both started up and then
> one
> > of them goes away that I see this problem. Obviously, however this is
> the
> > very situation that calls for an HA setup in the real world. When one
> server
> > goes offline suddenly, you want to be able to keep on using the first.
> >
> > Here is the simplest spec file configuration that exhibits this problem:
> >
> > Simple server configuration:
> >
> > volume brick-ds
> >     type storage/posix
> >     option directory /.brick-ds
> > end-volume
> >
> >  volume brick-ds-afr
> >     type storage/posix
> >     option directory /.brick-ds-afr
> > end-volume
> >
> > volume server
> >     type protocol/server
> >     option transport-type tcp/server
> >     option bind-address 192.168.16.128 # 192.168.16.1 on the other
> server
> >     subvolumes brick-ds brick-ds-afr
> >     option auth.ip.brick-ds.allow 192.168.16.*
> >     option auth.ip.brick-ds-afr.allow 192.168.16.*
> > end-volume
> >
> >
> > Client Configuration :
> >
> >    volume brick-ds-local
> >      type protocol/client
> >      option transport-type tcp/client
> >      option remote-host 192.168.16.128 # 192.168.16.1 on the other
> machine
> >      option remote-subvolume brick-ds
> >    end-volume
> >
> >    volume brick-ds-remote
> >       type protocol/client
> >       option transport-type tcp/client
> >       option remote-host 192.168.16.1 # 192.168.16.128 on the other
> machine
> >       option remote-subvolume brick-ds-afr
> >     end-volume
> >
> >      volume brick-ds-afr
> >       type cluster/afr
> >       subvolumes brick-ds-local brick-ds-remote
> >       option replicate *:2
> >     end-volume
> >
> > These are both stock CentOS/RHEL 5 machines. You can demonstrate the
> > behavior by rebooting one machine, pulling out the ethernet cable, or
> > sending the route out into space (ie route add -host 192.168.16.1
> > some_disconnected_device). Everything will be frozen until the
> connection
> > returns and then when it comes back up, things keep working again after
> > that.
> >
> > Because of this problem, any kind of  HA / unify setup will not work for
> me
> > when one of the nodes fails.
> >
> > Can someone else verify this behavior? If there is some part of the logs
> /
> > strace / gdb output you'd like to see , just let me know. I'd really
> like to
> > use glusterfs in an HA setup, but don't see how with this behavior.
> >
> > Thanks in advance!!
> > :august
> >
> >
> > On 9/7/07, August R. Wohlt < glusterfs@xxxxxxxxxxx> wrote:
> > >
> > > Hi all -
> > >
> > > I have a setup based on this :
> > >
> >
> http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS
> > > but with only 2 machines. Effectively just a mirror (glusterfsd
> > configuration below). 1.3.1 client and server.
> > >
> > >
> >
> >
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
Amar Tumballi
Engineer - Gluster Core Team
[bulde on #gluster/irc.gnu.org]
http://www.zresearch.com - Commoditizing Supercomputing and Superstorage!