Re: afr :2 HA setup question

"August R. Wohlt" <glusterfs@xxxxxxxxxxx> · Mon, 10 Sep 2007 19:56:40 -0400

Hi all -

After combing through the archives, I found the transport-timeout
option mentioned by avati. Is this described in the wiki docs
anywhere? I thought I had read through every page, but don't recall
seeing it. The e-mail from avati mentioned that it was described in
"doc/translator-options.txt" but this file does not appear in my
glusterfs-1.3.1 tarball.

In any case, for those who have similar issues, making transport
timeout much smaller is your friend :-)

Many Thanks!!
:august

On 9/10/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote:
> Hi devs et al,
>
> After many hours of sublimation, I was able to condense my previous hanging
> issue down to this simplest case.
>
> To summarize: I have two physical machines, each afr'ing a directory to the
> other. both are glusterfs(d) 1.3.1 with glfs3 fuse. iptables is suspended
> during these tests. Spec files are below.
>
> The four situations:
>
> 1) If I start up both machines and start up glusterfsd on both machines, I
> can mount either one from the other and view its files as expected.
>
> 2) If I start up only one machine and glusterfsd, I can mount that
> glusterfsd brick from the same machine and use it (ie edit the files) while
> it tries to connect to the 2nd machine in the background. When I bring up
> the 2nd machine, it connects and afrs as expected. Compare this to #4).
>
> 3) If I start up both machines and glusterfsd on both, mount each others'
> bricks, verify I can see the files and then kill glusterfsd on one of them,
> I can still use and view files on the other one while it tries to reconnect
> in the background to the glusterfsd that was killed. When it comes back up
> everything continues as expected.
>
> 4) But, if I startup both machines with glusterfsd on both, mount either
> brick and view the files and then bring down the other machine (ie not kill
> glusterfsd, but bring down the whole machine suddenly, or pull the ethernet
> cable) , I can no longer see any files on the remaining machine. It just
> hangs until the machine that is down comes back up and then it continues on
> its merry way.
>
> This is presumably not the expected behavior since it is not the behavior in
> 2) and 3). It is only after the machines have both started up and then one
> of them goes away that I see this problem. Obviously, however this is the
> very situation that calls for an HA setup in the real world. When one server
> goes offline suddenly, you want to be able to keep on using the first.
>
> Here is the simplest spec file configuration that exhibits this problem:
>
> Simple server configuration:
>
> volume brick-ds
>     type storage/posix
>     option directory /.brick-ds
> end-volume
>
>  volume brick-ds-afr
>     type storage/posix
>     option directory /.brick-ds-afr
> end-volume
>
> volume server
>     type protocol/server
>     option transport-type tcp/server
>     option bind-address 192.168.16.128 # 192.168.16.1 on the other server
>     subvolumes brick-ds brick-ds-afr
>     option auth.ip.brick-ds.allow 192.168.16.*
>     option auth.ip.brick-ds-afr.allow 192.168.16.*
> end-volume
>
>
> Client Configuration :
>
>    volume brick-ds-local
>      type protocol/client
>      option transport-type tcp/client
>      option remote-host 192.168.16.128 # 192.168.16.1 on the other machine
>      option remote-subvolume brick-ds
>    end-volume
>
>    volume brick-ds-remote
>       type protocol/client
>       option transport-type tcp/client
>       option remote-host 192.168.16.1 # 192.168.16.128 on the other machine
>       option remote-subvolume brick-ds-afr
>     end-volume
>
>      volume brick-ds-afr
>       type cluster/afr
>       subvolumes brick-ds-local brick-ds-remote
>       option replicate *:2
>     end-volume
>
> These are both stock CentOS/RHEL 5 machines. You can demonstrate the
> behavior by rebooting one machine, pulling out the ethernet cable, or
> sending the route out into space (ie route add -host 192.168.16.1
> some_disconnected_device). Everything will be frozen until the connection
> returns and then when it comes back up, things keep working again after
> that.
>
> Because of this problem, any kind of  HA / unify setup will not work for me
> when one of the nodes fails.
>
> Can someone else verify this behavior? If there is some part of the logs /
> strace / gdb output you'd like to see , just let me know. I'd really like to
> use glusterfs in an HA setup, but don't see how with this behavior.
>
> Thanks in advance!!
> :august
>
>
> On 9/7/07, August R. Wohlt < glusterfs@xxxxxxxxxxx> wrote:
> >
> > Hi all -
> >
> > I have a setup based on this :
> >
>  http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS
> > but with only 2 machines. Effectively just a mirror (glusterfsd
> configuration below). 1.3.1 client and server.
> >
> >
>
>