HA failover test unsuccessful (inaccessible mountpoint)

"Daniel Maher" <dma+gluster@xxxxxxxxx> · Wed, 2 Apr 2008 18:58:39 +0200

Hello all,

First off, thanks for the great feedback i received during the course
of the day so far.  I set up a four machine test network (two file
servers, two clients) in order to evaluate Gluster for an upcoming
upgrade / consolidation project we've got coming up.

I based my configuration on the HA w/ 1.3 document on the wiki :
http://www.gluster.org/docs/index.php/GlusterFS_1.3_High_Availability_Storage_with_GlusterFS

Setting up the servers and clients was easy, and it worked immediately
(which is quite a change from the usual problems one has with
network-aware file systems).  Unfortunately, it failed on one crucial
test : failover.

Briefly stated, when i physically unplugged one of the two (mirrored)
file servers from the network, the mountpoint on the clients became
completely inaccessible.  Attempting to change to the directory, modify
files, or even list the contents of the parent directory resulted in a
hung terminal session.  This state remained until the unplugged file
server was reattached to the network.

I was under the impression that this would not be the case; indeed,
from what i've read in the documentation, the mountpoint should have
continued to be accessible (since the other file server was still alive
and well).  Ideally, in an HA environment, having one of the
storage nodes disappear should /not/ bring down the entire storage
cluster.

I'm curious to know if this is the expected behaviour (which i doubt),
or if i've simply missed something in my configuration which would
cause this (more likely ;) ).

And now, for the gritty details...

The four machines each have two network interfaces; eth0 is connected
to the "general" network (192.168.0.*), and eth1 is connected to a
physically distinct gigabit network (10.0.0.*), upon which only
gluster-related interactions are meant to travel.

A DNS zone called "storage-net.gfs" was set up, with each of the
machines being assigned A-records within this zone (10.0.0.* /
dfs[ABCD].storage-net.gfs).  dfs[AB] are the clients, and dfs[CD] are
the servers.  Finally, "cluster.storage-net.gfs" was assigned
round robin-style to dfs[CD] (again, as per the documentation).

A graphical overview of the test network may be interesting :
http://tinypic.info/files/xhvyldlesd8igvjt8yl1.png

As i noted above, i followed the HA document to create both the server
and client configurations.  The server configuration :
http://pastebin.ca/967749

And the client configuration :
http://pastebin.ca/967754

-- 
Daniel Maher <dma AT witbe.net>