Martin Fick wrote:
Why is that the correct way? There's nothing
wrong with having "bonding" at the glusterfs
protocol level, is there?
The problem is that it only covers a very narrow edge case
that isn't all that likely. A bonded NIC over separate
switches all the way to both servers is a much more sensible
option. Or else what failure are you trying to protect
yourself against? It's a bit like fitting a big padlock
on the door when there's a wall missing.
I think you need to be more specific then using
analogies. My only guess from your assertions is
that you have a very narrow specific use case /
setup / terminology in mind that does not
necessarily mesh with my narrow use case ... :)
LOL! That is a distinct possibility. :)
So, the HA translator supports talking to two
different servers with two different transport
mechanism and two different IPs. Bonding does
not support anything like this is far as I can
tell.
True. Bonding is more transparent. You make two NICs into one virtual
NIC and round-robin packets down them. If one NIC/path fails, all the
traffic will fail over to the other NIC/path.
So, it seems like you are assuming a
different back end use case, one where the
servers employ the same IP perhaps using round
robin or perhaps in an active passive way.
No, not at all. Multiple servers, 1 floating IP per server. Floating as
in it can be migrated to the other server if one fails. You balance the
load by assigning half of your clients to one floating IP, and the other
half of the clients to the other floating IP. So, when both servers are
up, each handles half the load. If one server fails, it's IP gets
migrated to the other server, and all clients thereafter talk to the
surviving server since it has both IPs (until the other server comes
back up and asks for it's IP address back).
Both
of these are very different beasts and I would
need to know which you are talking about to
understand what you are getting at. But the HA
translator setup is closer to the round robin
(active/active) setup and I am guessing you
are taking about an active / passive setup.
In general, there are relatively few things that you cannot make
active/active, so I always mean active/active + failover unless I
explicitly state it.
That is somewhat what the HA translator is, except
that it is supposed to take care of some additional
failures. It is supposed to retransmit "in
progress" operations that have not succeeded because of
comm failures (I have yet to figure out where in the code
this happens though).
This is a reinvention of a wheel. NFS already handles this
gracefully for the use-case you are describing.
I am lost, what does NFS have to do with it?
It already handles the "server has gone away" situation gracefully. What
I'm saying is that you can use GlusterFS underneath for mirroring the
data (AFR) and re-export with NFS to the clients. If you want to avoid
client-side AFR and still have graceful failover with lightweight
transport, NFS is not a bad choice.
Why re-invent the wheel when the tools to deal
with these
failure modes already exist?
Are you referring to bonding here? If so, see above
why HA may be better (or additional benefit).
My original point is that it doesn't add anything new
that you couldn't achieve with tools that are already
available.
Well, I was trying to explain to you that it
does, but then the NFS thing, I am confused.
How do current tools achieve the following
setup? Client A talks to Server A and
submits a read request. The read request
is received on Server A (TCP acked to the
client), and then Server A dies. How will
the following request be completed without
glusterfs returning an "endpoint not
connected" error?
You make client <-> server comms NFS.
You make server <-> server comms GlusterFS.
If the NFS server goes away, the client will keep retrying until the
server returns. In this case, that would mean it'll keep retrying until
the other server fails the IP address over to itself.
This achieves:
1) server side AFR with GlusterFS for redundancy
2) client connects to a single server via NFS so there's no
double-bandwidth used by the client
3) servers can fail over relatively transparently to the client
No, I have not confirmed that this actually
works with the HA translator, but I was told
that the following would happen if it were
used. Client A talks to Server A and
submits a read request. The read request
is received on Server A (TCP acked to the
client), and then Server A dies. Client A
will then in theory retry the read request
on Server B. Bonding cannot do anything
like this (since the read was tcp ACKed)?
Agreed, if a server fails, bonding won't help. Cluster fail-over
server-side, however, will, provided the network file system protocol
can deal with it reasonably well.
Neither can heartbeat/failover
of an active/passive backend since on the
first failure the client will get a
connection error and the glusterfs client
protocol does not retransmit).
This is where I clearly failed to clarify what I meant. I was talking
about using NFS for the client<->server part of the communication. NFS
will typically block until the server starts responding again (note: it
doesn't have to be the same server, just one like it).
I think that this is quite different from
any bonding solution. Not better, different,
If I were to use this it would not preclude
me from also using bonding, but it solves a
somewhat different problem. It is not a
complete solution, it is a piece, but not
a duplicated piece. If you don't like it,
or it doesn't fit your backend use case,
don't use it! :)
If it can handle the described failure more gracefully than what I'm
proposing, then I'm all for it. I'm just not sure there is that much
scope for it being better since the last write may not have made it to
the mirror server anyway, so even if the protocol can re-try, it would
need to have some kind of journaling, roll back the journal and replay
the operation.
This, however, is a much more complex approach (very similar to what GFS
does), and there is a high price to pay in terms of performance when the
nodes aren't on the same LAN.
Yes, if a server goes down you are fine (aside from the
scenario where the other server then goes down followed
by the first one coming back up). But, if you are using
the HA translator above and the communication goes down
between the two servers you may still get split brain
(thus the need for heartbeat/fencing).
And therein lies the problem - unless you are proposing
adding a complete fencing infrastructure into glusterfs,
too.
No. I am proposing adding a complete transactional
model to AFR so that if a write fails on one node,
some policy can decide whether the same write
should be committed of rolled back on the other
nodes. Today, the policy is to simply apply it to
the other nodes regardless. This is a recipe for
split brain.
OK, I get what you mean. It's basically the same problem I described
above when I mentioned that you'd need some kind of a journal to
roll-back the operation that hasn't been fully committed.
In the case of network segregation some policy
should decide to allow writes to be applied
to one side of the segregation and denied on the
other. This does not require fencing (but it
would be better with it), it could be a simple
policy like: "apply writes if a majority of nodes
can be reached", if not fail (or block would be
even better).
Hmm... This could lead to an elastic shifting quorum. I'm not sure how
you'd handle resyncing if nodes are constantly leaving/joining. It seems
a bit non-deterministic.
AFR needs to be able write all or nothing to all
servers until some external policy machine (such as
heartbeat) decides that it is safe (because of fencing or
other mechanism) to proceed writing to only a portion of the
subvolumes (servers). Without this I don't see how you
can prevent split brain?
With server-side AFR, splitbrain cannot really occur (OK,
there's a tiny window of opportunity for it if the
server isn't really totally dead since there's no
total FS lock-out until fencing is completed like on GFS,
but it's probably close enough). If the server's
can't heartbeat to each other, they can't AFR to
each other, either. So either the write gets propagated, or
it doesn't. The machine that remained operational will
have more up to date files and as necessary those will get
synced back. It's not quite as tight as GFS in terms of
ensuring data consistency like a DRBD+GFS solution would be,
but it is probably close enough for most use-cases.
I guess what you call tiny, I call huge. Even if
you have your heartbeat fencing occur in under a
tenth of a second, that is time enough to split
brain a major portion of a filesystem. I would
never trust it.
In GlusterFS that problem exists anyway, but it is largely mitigated by
the fact that it works on file level rather than block device level. In
the case of GFS, RHCS will block all access to the file system until the
note is successfully fenced and confirmed fenced before rolling back
it's journals and resuming operation.
To borrow your analogy, adding heartbeat to the
current AFR: "It's a bit like fitting a big
padlock on the door when there's a wall missing."
:)
Every single write needs to ensure that it will
not cause split brain for me to trust it.
Sounds like GlusterFS isn't necessarily the solution for you, then. :(
If not, why would I bother with gluserfs over
AFR instead of glusterfs over DRBD? Oh right,
because I cannot get glusterfs to failover without
incurring connection errors on the client! ;)
(not your beef, I know, from another thread)
Precisely - which is why I originally suggested not using GlusterFS for
client-server communication. :)
This is one reason I was hoping that the HA
translator would address this, but the HA
translator is useless in an active/passive
backend setup, it only works in active/active.
If you try using it in an active/passive setup,
during failover it will retry too quickly on
the second server causing connection errors
on the client!!! This is the primary reason
that I am suggesting that the HA translator
block until the connection is restored, it
would allow for failovers to occur.
And this is exactly why I suggested using NFS for the client<->server
connection. NFS blocks until the server becomes contactable again.
But, to be clear, I am not disagreeing with you
that the HA translator does not solve the split
brain problem at all. Perhaps this is what is
really "upsetting" you, not that it is
"duplicated" functionality, but rather that it
does not help AFR solve it's split brain
personality disorders, it only helps make them
more available, thus making split brain even
more likely!! ;(
I'm not sure it makes it any worse WRT split-brain, it just seems that
you are looking for GlusterFS+HA to provide you with exactly the same
set of features that NFS+(server fail-over) already provides. Of course,
there could be advantages in GlusterFS behaving the same way as NFS when
the server goes away if it's a single-server setup - it would be easier
to set up and a bit more elegant. But it wouldn's add any functionality
that couldn't be re-created using the sort of a setup I described.
Gordan