Hi August,
I can confirm your problem with your setup. I' ve a 4 Server glusterfsd
setup also with 1.3.1 running and some glusterfs clients with fuse glfs3.
One of these 4 servers had a hardware failure and was no longer
reachable -> so the side effect was, that all of my glusterfs Clients
couldn't write anything in the mounted glusterfs share. I've build a new
test Machine changed the old one with this new machine. Probably this
week, I have more time for playing and testing with glusterfs (also with
some performance translators).
I will test the "option transport-timeout X" and will see what happen if
I take one of them of the net.
Regards,
Matthias
August R. Wohlt schrieb:
Hi all -
After combing through the archives, I found the transport-timeout
option mentioned by avati. Is this described in the wiki docs
anywhere? I thought I had read through every page, but don't recall
seeing it. The e-mail from avati mentioned that it was described in
"doc/translator-options.txt" but this file does not appear in my
glusterfs-1.3.1 tarball.
In any case, for those who have similar issues, making transport
timeout much smaller is your friend :-)
Many Thanks!!
:august
On 9/10/07, August R. Wohlt <glusterfs@xxxxxxxxxxx> wrote:
Hi devs et al,
After many hours of sublimation, I was able to condense my previous hanging
issue down to this simplest case.
To summarize: I have two physical machines, each afr'ing a directory to the
other. both are glusterfs(d) 1.3.1 with glfs3 fuse. iptables is suspended
during these tests. Spec files are below.
The four situations:
1) If I start up both machines and start up glusterfsd on both machines, I
can mount either one from the other and view its files as expected.
2) If I start up only one machine and glusterfsd, I can mount that
glusterfsd brick from the same machine and use it (ie edit the files) while
it tries to connect to the 2nd machine in the background. When I bring up
the 2nd machine, it connects and afrs as expected. Compare this to #4).
3) If I start up both machines and glusterfsd on both, mount each others'
bricks, verify I can see the files and then kill glusterfsd on one of them,
I can still use and view files on the other one while it tries to reconnect
in the background to the glusterfsd that was killed. When it comes back up
everything continues as expected.
4) But, if I startup both machines with glusterfsd on both, mount either
brick and view the files and then bring down the other machine (ie not kill
glusterfsd, but bring down the whole machine suddenly, or pull the ethernet
cable) , I can no longer see any files on the remaining machine. It just
hangs until the machine that is down comes back up and then it continues on
its merry way.
This is presumably not the expected behavior since it is not the behavior in
2) and 3). It is only after the machines have both started up and then one
of them goes away that I see this problem. Obviously, however this is the
very situation that calls for an HA setup in the real world. When one server
goes offline suddenly, you want to be able to keep on using the first.
Here is the simplest spec file configuration that exhibits this problem:
Simple server configuration:
volume brick-ds
type storage/posix
option directory /.brick-ds
end-volume
volume brick-ds-afr
type storage/posix
option directory /.brick-ds-afr
end-volume
volume server
type protocol/server
option transport-type tcp/server
option bind-address 192.168.16.128 # 192.168.16.1 on the other server
subvolumes brick-ds brick-ds-afr
option auth.ip.brick-ds.allow 192.168.16.*
option auth.ip.brick-ds-afr.allow 192.168.16.*
end-volume
Client Configuration :
volume brick-ds-local
type protocol/client
option transport-type tcp/client
option remote-host 192.168.16.128 # 192.168.16.1 on the other machine
option remote-subvolume brick-ds
end-volume
volume brick-ds-remote
type protocol/client
option transport-type tcp/client
option remote-host 192.168.16.1 # 192.168.16.128 on the other machine
option remote-subvolume brick-ds-afr
end-volume
volume brick-ds-afr
type cluster/afr
subvolumes brick-ds-local brick-ds-remote
option replicate *:2
end-volume
These are both stock CentOS/RHEL 5 machines. You can demonstrate the
behavior by rebooting one machine, pulling out the ethernet cable, or
sending the route out into space (ie route add -host 192.168.16.1
some_disconnected_device). Everything will be frozen until the connection
returns and then when it comes back up, things keep working again after
that.
Because of this problem, any kind of HA / unify setup will not work for me
when one of the nodes fails.
Can someone else verify this behavior? If there is some part of the logs /
strace / gdb output you'd like to see , just let me know. I'd really like to
use glusterfs in an HA setup, but don't see how with this behavior.
Thanks in advance!!
:august
On 9/7/07, August R. Wohlt < glusterfs@xxxxxxxxxxx> wrote:
Hi all -
I have a setup based on this :
http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS
but with only 2 machines. Effectively just a mirror (glusterfsd
configuration below). 1.3.1 client and server.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel