Re: Architecture advice

Gordan Bobic <gordan@xxxxxxxxxx> · Mon, 12 Jan 2009 23:41:24 +0000

Martin Fick wrote:
Why is that the correct way?  There's nothing
wrong with having "bonding" at the glusterfs
protocol level, is there?

The problem is that it only covers a very narrow edge case
that isn't all that likely. A bonded NIC over separate
switches all the way to both servers is a much more sensible
option. Or else what failure are you trying to protect
yourself against? It's a bit like fitting a big padlock
on the door when there's a wall missing.

I think you need to be more specific then using 
analogies.  My only guess from your assertions is 
that you have a very narrow specific use case /
setup / terminology in mind that does not 
necessarily mesh with my narrow use case ... :)

LOL! That is a distinct possibility. :)

So, the HA translator supports talking to two
different servers with two different transport
mechanism and two different IPs.  Bonding does 
not support anything like this is far as I can 
tell.

True. Bonding is more transparent. You make two NICs into one virtual 
NIC and round-robin packets down them. If one NIC/path fails, all the 
traffic will fail over to the other NIC/path.

So, it seems like you are assuming a
different back end use case, one where the 
servers employ the same IP perhaps using round 
robin or perhaps in an active passive way.

No, not at all. Multiple servers, 1 floating IP per server. Floating as 
in it can be migrated to the other server if one fails. You balance the 
load by assigning half of your clients to one floating IP, and the other 
half of the clients to the other floating IP. So, when both servers are 
up, each handles half the load. If one server fails, it's IP gets 
migrated to the other server, and all clients thereafter talk to the 
surviving server since it has both IPs (until the other server comes 
back up and asks for it's IP address back).

Both
of these are very different beasts and I would
need to know which you are talking about to
understand what you are getting at.  But the HA
translator setup is closer to the round robin
(active/active) setup and I am guessing you 
are taking about an active / passive setup.

In general, there are relatively few things that you cannot make 
active/active, so I always mean active/active + failover unless I 
explicitly state it.

That is somewhat what the HA translator is, except
that it is supposed to take care of some additional
failures.  It is supposed to retransmit "in
progress" operations that have not succeeded because of
comm failures (I have yet to figure out where in the code
this happens though).

This is a reinvention of a wheel. NFS already handles this
gracefully for the use-case you are describing.

I am lost, what does NFS have to do with it?

It already handles the "server has gone away" situation gracefully. What 
I'm saying is that you can use GlusterFS underneath for mirroring the 
data (AFR) and re-export with NFS to the clients. If you want to avoid 
client-side AFR and still have graceful failover with lightweight 
transport, NFS is not a bad choice.

Why re-invent the wheel when the tools to deal
with these
failure modes already exist?
Are you referring to bonding here? If so, see above
why HA may be better (or additional benefit).

My original point is that it doesn't add anything new
that you couldn't achieve with tools that are already
available.

Well, I was trying to explain to you that it
does, but then the NFS thing, I am confused.

How do current tools achieve the following
setup?  Client A talks to Server A and 
submits a read request.  The read request 
is received on Server A (TCP acked to the 
client), and then Server A dies.  How will
the following request be completed without
glusterfs returning an "endpoint not 
connected" error?

You make client <-> server comms NFS.
You make server <-> server comms GlusterFS.

If the NFS server goes away, the client will keep retrying until the 
server returns. In this case, that would mean it'll keep retrying until 
the other server fails the IP address over to itself.

This achieves:
1) server side AFR with GlusterFS for redundancy
2) client connects to a single server via NFS so there's no 
double-bandwidth used by the client
3) servers can fail over relatively transparently to the client

No, I have not confirmed that this actually
works with the HA translator, but I was told
that the following would happen if it were 
used.  Client A talks to Server A and 
submits a read request.  The read request 
is received on Server A (TCP acked to the 
client), and then Server A dies.  Client A
will then in theory retry the read request
on Server B.  Bonding cannot do anything
like this (since the read was tcp ACKed)?  

Agreed, if a server fails, bonding won't help. Cluster fail-over 
server-side, however, will, provided the network file system protocol 
can deal with it reasonably well.

Neither can heartbeat/failover
of an active/passive backend since on the
first failure the client will get a 
connection error and the glusterfs client
protocol does not retransmit).

This is where I clearly failed to clarify what I meant. I was talking 
about using NFS for the client<->server part of the communication. NFS 
will typically block until the server starts responding again (note: it 
doesn't have to be the same server, just one like it).

I think that this is quite different from
any bonding solution.  Not better, different,
If I were to use this it would not preclude 
me from also using bonding, but it solves a 
somewhat different problem.  It is not a 
complete solution, it is a piece, but not
a duplicated piece.  If you don't like it,
or it doesn't fit your backend use case, 
don't use it! :)

If it can handle the described failure more gracefully than what I'm 
proposing, then I'm all for it. I'm just not sure there is that much 
scope for it being better since the last write may not have made it to 
the mirror server anyway, so even if the protocol can re-try, it would 
need to have some kind of journaling, roll back the journal and replay 
the operation.

This, however, is a much more complex approach (very similar to what GFS 
does), and there is a high price to pay in terms of performance when the 
nodes aren't on the same LAN.

Yes, if a server goes down you are fine (aside from the
scenario where the other server then goes down followed
by the first one coming back up).  But, if you are using
the HA translator above and the communication goes down
between the two servers you may still get split brain
(thus the need for heartbeat/fencing).
And therein lies the problem - unless you are proposing
adding a complete fencing infrastructure into glusterfs,
too.

No. I am proposing adding a complete transactional 
model to AFR so that if a write fails on one node, 
some policy can decide whether the same write 
should be committed of rolled back on the other 
nodes.  Today, the policy is to simply apply it to 
the other nodes regardless.  This is a recipe for 
split brain.  

OK, I get what you mean. It's basically the same problem I described 
above when I mentioned that you'd need some kind of a journal to 
roll-back the operation that hasn't been fully committed.

In the case of network segregation some policy 
should decide to allow writes to be applied
to one side of the segregation and denied on the 
other.  This does not require fencing (but it
would be better with it), it could be a simple 
policy like: "apply writes if a majority of nodes 
can be reached", if not fail (or block would be
even better).

Hmm... This could lead to an elastic shifting quorum. I'm not sure how 
you'd handle resyncing if nodes are constantly leaving/joining. It seems 
a bit non-deterministic.

AFR needs to be able write all or nothing to all
servers until some external policy machine (such as
heartbeat) decides that it is safe (because of fencing or
other mechanism) to proceed writing to only a portion of the
subvolumes (servers).  Without this I don't see how you
can prevent split brain?
With server-side AFR, splitbrain cannot really occur (OK,
there's a tiny window of opportunity for it if the
server isn't really totally dead since there's no
total FS lock-out until fencing is completed like on GFS,
but it's probably close enough). If the server's
can't heartbeat to each other, they can't AFR to
each other, either. So either the write gets propagated, or
it doesn't. The machine that remained operational will
have more up to date files and as necessary those will get
synced back. It's not quite as tight as GFS in terms of
ensuring data consistency like a DRBD+GFS solution would be,
but it is probably close enough for most use-cases.

I guess what you call tiny, I call huge.  Even if 
you have your heartbeat fencing occur in under a
tenth of a second, that is time enough to split 
brain a major portion of a filesystem.  I would 
never trust it.

In GlusterFS that problem exists anyway, but it is largely mitigated by 
the fact that it works on file level rather than block device level. In 
the case of GFS, RHCS will block all access to the file system until the 
note is successfully fenced and confirmed fenced before rolling back 
it's journals and resuming operation.

To borrow your analogy, adding heartbeat to the 
current AFR:  "It's a bit like fitting a big 
padlock on the door when there's a wall missing."
:)  

Every single write needs to ensure that it will 
not cause split brain for me to trust it.

Sounds like GlusterFS isn't necessarily the solution for you, then. :(

If not, why would I bother with gluserfs over
AFR instead of glusterfs over DRBD?  Oh right, 
because I cannot get glusterfs to failover without
incurring connection errors on the client! ;)
(not your beef, I know, from another thread)

Precisely - which is why I originally suggested not using GlusterFS for 
client-server communication. :)

This is one reason I was hoping that the HA
translator would address this, but the HA
translator is useless in an active/passive
backend setup, it only works in active/active.
If you try using it in an active/passive setup,
during failover it will retry too quickly on
the second server causing connection errors
on the client!!!  This is the primary reason
that I am suggesting that the HA translator
block until the connection is restored, it
would allow for failovers to occur.

And this is exactly why I suggested using NFS for the client<->server 
connection. NFS blocks until the server becomes contactable again.

But, to be clear, I am not disagreeing with you
that the HA translator does not solve the split
brain problem at all.  Perhaps this is what is 
really "upsetting" you, not that it is
"duplicated" functionality, but rather that it 
does not help AFR solve it's split brain 
personality disorders, it only helps make them 
more available, thus making split brain even 
more likely!! ;(

I'm not sure it makes it any worse WRT split-brain, it just seems that 
you are looking for GlusterFS+HA to provide you with exactly the same 
set of features that NFS+(server fail-over) already provides. Of course, 
there could be advantages in GlusterFS behaving the same way as NFS when 
the server goes away if it's a single-server setup - it would be easier 
to set up and a bit more elegant. But it wouldn's add any functionality 
that couldn't be re-created using the sort of a setup I described.

Gordan