anti-cephalopod question

michael@xxxxxxxxxxxxxxxxxx (Michael) · Mon, 28 Jul 2014 19:56:00 +0100

If you've two rooms then I'd go for two OSD nodes in each room, a target 
replication level of 3 with a min of 1 across the node level, then have 
5 monitors and put the last monitor outside of either room (The other 
MON's can share with the OSD nodes if needed). Then you've got 'safe' 
replication for OSD/node replacement on failure with some 'shuffle' room 
for when it's needed and either room can be down while the external last 
monitor allows the decisions required to allow a single room to operate.

There's no way you can do a 3/2 MON split that doesn't risk the two 
nodes being up and unable to serve data while the three are down so 
you'd need to find a way to make it a 2/2/1 split instead.

-Michael

On 28/07/2014 18:41, Robert Fantini wrote:
> OK for higher availability then  5 nodes is better then 3 .  So we'll 
> run 5 .  However we want normal operations with just 2 nodes.   Is 
> that possible?
>
> Eventually 2 nodes will be next building 10 feet away , with a brick 
> wall in between.  Connected with Infiniband or better. So one room can 
> go off line the other will be on.   The flip of the coin means the 3 
> node room will probably go down.
>  All systems will have dual power supplies connected to different 
> UPS'.   In addition we have a power generator. Later we'll have a 2-nd 
> generator. and then  the UPS's will use different lines attached to 
> those generators somehow..
> Also of course we never count on one  cluster  to have our data.  We 
> have 2  co-locations with backup going to often using zfs send receive 
> and or rsync .
>
> So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or 
> is that a bad idea?
>
>
> PS:  any other idea on how to increase availability are welcome .
>
>
>
>
>
>
>
>
> On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <chibi at gol.com 
> <mailto:chibi at gol.com>> wrote:
>
>     On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
>
>     > On 07/28/2014 08:49 AM, Christian Balzer wrote:
>     > >
>     > > Hello,
>     > >
>     > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
>     > >
>     > >> Hello Christian,
>     > >>
>     > >> Let me supply more info and answer some questions.
>     > >>
>     > >> * Our main concern is high availability, not speed.
>     > >> Our storage requirements are not huge.
>     > >> However we want good keyboard response 99.99% of the time.   We
>     > >> mostly do data entry and reporting. 20-25  users doing mostly
>     > >> order , invoice processing and email.
>     > >>
>     > >> * DRBD has been very reliable , but I am the SPOF .   Meaning
>     that
>     > >> when split brain occurs [ every 18-24 months ] it is me or no
>     one who
>     > >> knows what to do. Try to explain how to deal with split brain in
>     > >> advance.... For the future ceph looks like it will be easier to
>     > >> maintain.
>     > >>
>     > > The DRBD people would of course tell you to configure things
>     in a way
>     > > that a split brain can't happen. ^o^
>     > >
>     > > Note that given the right circumstances (too many OSDs down,
>     MONs down)
>     > > Ceph can wind up in a similar state.
>     >
>     >
>     > I am not sure what you mean by ceph winding up in a similar
>     state.  If
>     > you mean regarding 'split brain' in the usual sense of the term,
>     it does
>     > not occur in Ceph.  If it does, you have surely found a bug and you
>     > should let us know with lots of CAPS.
>     >
>     > What you can incur though if you have too many monitors down is
>     cluster
>     > downtime.  The monitors will ensure you need a strict majority of
>     > monitors up in order to operate the cluster, and will not serve
>     requests
>     > if said majority is not in place.  The monitors will only serve
>     requests
>     > when there's a formed 'quorum', and a quorum is only formed by
>     (N/2)+1
>     > monitors, N being the total number of monitors in the cluster
>     (via the
>     > monitor map -- monmap).
>     >
>     > This said, if out of 3 monitors you have 2 monitors down, your
>     cluster
>     > will cease functioning (no admin commands, no writes or reads
>     served).
>     > As there is no configuration in which you can have two strict
>     > majorities, thus no two partitions of the cluster are able to
>     function
>     > at the same time, you do not incur in split brain.
>     >
>     I wrote similar state, not "same state".
>
>     From a user perspective it is purely semantics how and why your shared
>     storage has seized up, the end result is the same.
>
>     And yes, that MON example was exactly what I was aiming for, your
>     cluster
>     might still have all the data (another potential failure mode of
>     cause),
>     but is inaccessible.
>
>     DRBD will see and call it a split brain, Ceph will call it a Paxos
>     voting
>     failure, it doesn't matter one iota to the poor sod relying on that
>     particular storage.
>
>     My point was and is, when you design a cluster of whatever flavor,
>     make
>     sure you understand how it can (and WILL) fail, how to prevent
>     that from
>     happening if at all possible and how to recover from it if not.
>
>     Potentially (hopefully) in the case of Ceph it would be just to get a
>     missing MON back up.
>     But given that the failed MON might have a corrupted leveldb (it
>     happened
>     to me) will put Robert back into square one, as in, a highly qualified
>     engineer has to deal with the issue.
>     I.e somebody who can say "screw this dead MON, lets get a new one
>     in" and
>     is capable of doing so.
>
>     Regards,
>
>     Christian
>
>     > If you are a creative admin however, you may be able to enforce
>     split
>     > brain by modifying monmaps.  In the end you'd obviously end up
>     with two
>     > distinct monitor clusters, but if you so happened to not inform the
>     > clients about this there's a fair chance that it would cause
>     havoc with
>     > unforeseen effects.  Then again, this would be the operator's
>     fault, not
>     > Ceph itself -- especially because rewriting monitor maps is not
>     trivial
>     > enough for someone to mistakenly do something like this.
>     >
>     >    -Joao
>     >
>     >
>
>
>     --
>     Christian Balzer  Network/Systems Engineer
>     chibi at gol.com <mailto:chibi at gol.com>           Global OnLine
>     Japan/Fusion Communications
>     http://www.gol.com/
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140728/fe7ed8cc/attachment.htm>