anti-cephalopod question

robertfantini@xxxxxxxxx (Robert Fantini) · Mon, 28 Jul 2014 13:41:12 -0400

OK for higher availability then  5 nodes is better then 3 .  So we'll run 5
.  However we want normal operations with just 2 nodes.   Is that possible?

Eventually 2 nodes will be next building 10 feet away , with a brick wall
in between.  Connected with Infiniband or better. So one room can go off
line the other will be on.   The flip of the coin means the 3 node room
will probably go down.
 All systems will have dual power supplies connected to different UPS'.
In addition we have a power generator. Later we'll have a 2-nd generator.
and then  the UPS's will use different lines attached to those generators
somehow..
Also of course we never count on one  cluster  to have our data.  We have
2  co-locations with backup going to often using zfs send receive and or
rsync .

So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or is
that a bad idea?

PS:  any other idea on how to increase availability are welcome .

On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <chibi at gol.com> wrote:

> On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
>
> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
> > >
> > > Hello,
> > >
> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> > >
> > >> Hello Christian,
> > >>
> > >> Let me supply more info and answer some questions.
> > >>
> > >> * Our main concern is high availability, not speed.
> > >> Our storage requirements are not huge.
> > >> However we want good keyboard response 99.99% of the time.   We
> > >> mostly do data entry and reporting.   20-25  users doing mostly
> > >> order , invoice processing and email.
> > >>
> > >> * DRBD has been very reliable , but I am the SPOF .   Meaning that
> > >> when split brain occurs [ every 18-24 months ] it is me or no one who
> > >> knows what to do. Try to explain how to deal with split brain in
> > >> advance.... For the future ceph looks like it will be easier to
> > >> maintain.
> > >>
> > > The DRBD people would of course tell you to configure things in a way
> > > that a split brain can't happen. ^o^
> > >
> > > Note that given the right circumstances (too many OSDs down, MONs down)
> > > Ceph can wind up in a similar state.
> >
> >
> > I am not sure what you mean by ceph winding up in a similar state.  If
> > you mean regarding 'split brain' in the usual sense of the term, it does
> > not occur in Ceph.  If it does, you have surely found a bug and you
> > should let us know with lots of CAPS.
> >
> > What you can incur though if you have too many monitors down is cluster
> > downtime.  The monitors will ensure you need a strict majority of
> > monitors up in order to operate the cluster, and will not serve requests
> > if said majority is not in place.  The monitors will only serve requests
> > when there's a formed 'quorum', and a quorum is only formed by (N/2)+1
> > monitors, N being the total number of monitors in the cluster (via the
> > monitor map -- monmap).
> >
> > This said, if out of 3 monitors you have 2 monitors down, your cluster
> > will cease functioning (no admin commands, no writes or reads served).
> > As there is no configuration in which you can have two strict
> > majorities, thus no two partitions of the cluster are able to function
> > at the same time, you do not incur in split brain.
> >
> I wrote similar state, not "same state".
>
> From a user perspective it is purely semantics how and why your shared
> storage has seized up, the end result is the same.
>
> And yes, that MON example was exactly what I was aiming for, your cluster
> might still have all the data (another potential failure mode of cause),
> but is inaccessible.
>
> DRBD will see and call it a split brain, Ceph will call it a Paxos voting
> failure, it doesn't matter one iota to the poor sod relying on that
> particular storage.
>
> My point was and is, when you design a cluster of whatever flavor, make
> sure you understand how it can (and WILL) fail, how to prevent that from
> happening if at all possible and how to recover from it if not.
>
> Potentially (hopefully) in the case of Ceph it would be just to get a
> missing MON back up.
> But given that the failed MON might have a corrupted leveldb (it happened
> to me) will put Robert back into square one, as in, a highly qualified
> engineer has to deal with the issue.
> I.e somebody who can say "screw this dead MON, lets get a new one in" and
> is capable of doing so.
>
> Regards,
>
> Christian
>
> > If you are a creative admin however, you may be able to enforce split
> > brain by modifying monmaps.  In the end you'd obviously end up with two
> > distinct monitor clusters, but if you so happened to not inform the
> > clients about this there's a fair chance that it would cause havoc with
> > unforeseen effects.  Then again, this would be the operator's fault, not
> > Ceph itself -- especially because rewriting monitor maps is not trivial
> > enough for someone to mistakenly do something like this.
> >
> >    -Joao
> >
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi at gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140728/27632f18/attachment.htm>