Re: Architecture advice

Gordan Bobic <gordan@xxxxxxxxxx> · Mon, 12 Jan 2009 18:30:49 +0000

Martin Fick wrote:

Not on the client, anyway. But if you're AFR-ing on
server side, then your client always talks to one server
anyway. The traditional way to handle server failure in that
case is to set up Heartbeat or RHCS to fail over the IP
address resource to the surviving server.

The TCP connection will reset when the fail-over occurs -
I'm not sure how gracefully/transparently GlusterFS
reconnects.
...

1.4 supports an new HA translator that is meant for clients to contact servers that AFR each other.  Like this:

       Client
         |
        HA
       /   \
      /     \
     /       \
Server A   Server B
    |         | 
   AFR       AFR
    | \     / |
    |  \   /  |
    |   \ /   | 
    |    X    |
    |   / \   |
    |  /   \  |
   Vol A   Vol B

I wasn't aware of there being a HA translator built
into GlusterFS, but unless you have proper fencing in place,
failing over IP addresses won't work. Without proper
cluster fencing in place you can easily find yourself in a
split-brain situation where both servers think they have the
same IP address and neither can talk to any of the clients.

...
No need for fencing simply because you now use HA translator.
The assumption in this case is that the servers can still talk
to each other but that one server's connection to the clients
may have died.  

That means that 50% of the scope for failure will still wipe you out 
because you'll start splitbraining. Not the way forward at all. A 
fencing setup will at least preserve the data integrity. The correct way 
to handle comms channel failure between client and server is to have 
bonded interfaces going via different physical paths. _ONLY_ dealing 
with the situation where both servers are alive and connected to each 
other but we can only reach one due to an obscure failure somewhere in 
the network (e.g. a failed switch port or a failed NIC in the server) is 
a pretty half-arsed edge case.

Why re-invent the wheel when the tools to deal with these failure modes 
already exist?

Any failures on the server side may still warrant a fencing setup,
but AFR is not yet setup to work cooperatively with a fencing setup.

It doesn't have to be. If one server in AFR dies nothing spectacular 
happens. Things time out and carry on. I don't see what cooperation 
there would need to be. RHCS does it's own heart-beating and fencing. 
Mix and match as required.

Gordan