anti-cephalopod question

chibi@xxxxxxx (Christian Balzer) · Wed, 30 Jul 2014 23:53:20 +0900

Hello,

On Wed, 30 Jul 2014 05:21:18 -0400 Robert Fantini wrote:

> Christian.
> I'll start out with 4 nodes.  I understand re-balancing  takes time. [
> Eventually I'll need to swap out one of the nodes with a host I'm using
> for production..   But that'll be on a Saturday afternoon.. ]
> 
Your call, but it might not be pretty or short depending on the volume of
data involved.

> 
> However I do not fully get this:
> 
> 
> *"No, the default is to split at host level. So once you have enough
> nodes in one room to fulfill the replication level (3) some PGs will be
> all in that location "*
> 
> *can you please send this:*
> 
> 
> *non default firefly cepf.conf settings for a 4 node  anti-cephalopod
> cluster ?   *
> 
There is nothing you need to configure in any way special with the 4 nodes
you mentioned (2 in each location), provided that each holds one OSD only.

To avoid any unwanted recovery until you've reviewed things, use 
"mon osd downout subtree limit = host" 
in your ceph.conf or any of the other ways discussed to disable OSDs
being set out.

Christian

> I want to start my testing with close to ideal ceph settings .  Then do a
> lot of testing of  noout and other things.
> After I'm done I'll  document what was done and post it a few places.
> 
> I appreciate the suggestions you've sent .
> 
> kind regards, rob fantini
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Jul 29, 2014 at 9:49 PM, Christian Balzer <chibi at gol.com> wrote:
> 
> >
> > Hello,
> >
> > On Tue, 29 Jul 2014 06:33:14 -0400 Robert Fantini wrote:
> >
> > > Christian -
> > >  Thank you for the answer,   I'll get around to reading 'Crush Maps
> > > '  a few times  ,  it is important to have a good understanding of
> > > ceph parts.
> > >
> > >  So another question -
> > >
> > >  As long as I keep the same number of nodes in both rooms, will
> > > firefly defaults keep data balanced?
> > >
> > No, the default is to split at host level.
> > So once you have enough nodes in one room to fulfill the replication
> > level (3) some PGs will be all in that location.
> >
> > >
> > > If not I'll stick with 2 each room until I understand how configure
> > > things.
> > >
> > That will work, but I would strongly advise you to get it right from
> > the start, as in configure the Crush map to your needs split on room
> > or such.
> >
> > Because if you introduce this change later, your data will be
> > rebalanced...
> >
> > Christian
> >
> > >
> > > On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer <chibi at gol.com>
> > > wrote:
> > >
> > > >
> > > > On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote:
> > > >
> > > > > "target replication level of 3"
> > > > > " with a min of 1 across the node level"
> > > > >
> > > > > After reading
> > > > > http://ceph.com/docs/master/rados/configuration/ceph-conf/ ,   I
> > > > > assume that to accomplish that then set these in ceph.conf   ?
> > > > >
> > > > > osd pool default size = 3
> > > > > osd pool default min size = 1
> > > > >
> > > > Not really, the min size specifies how few replicas need to be
> > > > online for Ceph to accept IO.
> > > >
> > > > These (the current Firefly defaults) settings with the default
> > > > crush map will have 3 sets of data spread over 3 OSDs and not use
> > > > the same node (host) more than once.
> > > > So with 2 nodes in each location, a replica will always be both
> > > > locations. However if you add more nodes, all of them could wind
> > > > up in the same building.
> > > >
> > > > To prevent this, you have location qualifiers beyond host and you
> > > > can modify the crush map to enforce that at least one replica is
> > > > in a different rack, row, room, region, etc.
> > > >
> > > > Advanced material, but one really needs to understand this:
> > > > http://ceph.com/docs/master/rados/operations/crush-map/
> > > >
> > > > Christian
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 28, 2014 at 2:56 PM, Michael
> > > > > <michael at onlinefusion.co.uk
> > >
> > > > > wrote:
> > > > >
> > > > > >  If you've two rooms then I'd go for two OSD nodes in each
> > > > > > room, a target replication level of 3 with a min of 1 across
> > > > > > the node level, then have 5 monitors and put the last monitor
> > > > > > outside of either room (The other MON's can share with the OSD
> > > > > > nodes if needed). Then you've got 'safe' replication for
> > > > > > OSD/node replacement on failure with some 'shuffle' room for
> > > > > > when it's needed and either room can be down while the
> > > > > > external last monitor allows the decisions required to allow a
> > > > > > single room to operate.
> > > > > >
> > > > > > There's no way you can do a 3/2 MON split that doesn't risk
> > > > > > the two nodes being up and unable to serve data while the
> > > > > > three are down so you'd need to find a way to make it a 2/2/1
> > > > > > split instead.
> > > > > >
> > > > > > -Michael
> > > > > >
> > > > > >
> > > > > > On 28/07/2014 18:41, Robert Fantini wrote:
> > > > > >
> > > > > >  OK for higher availability then  5 nodes is better then 3 .
> > > > > > So we'll run 5 .  However we want normal operations with just 2
> > > > > > nodes.   Is that possible?
> > > > > >
> > > > > >  Eventually 2 nodes will be next building 10 feet away , with a
> > > > > > brick wall in between.  Connected with Infiniband or better. So
> > > > > > one room can go off line the other will be on.   The flip of
> > > > > > the coin means the 3 node room will probably go down.
> > > > > >  All systems will have dual power supplies connected to
> > > > > > different UPS'. In addition we have a power generator. Later
> > > > > > we'll have a 2-nd generator. and then  the UPS's will use
> > > > > > different lines attached to those generators somehow..
> > > > > > Also of course we never count on one  cluster  to have our
> > > > > > data. We have 2  co-locations with backup going to often using
> > > > > > zfs send receive and or rsync .
> > > > > >
> > > > > >  So for the 5 node cluster,  how do we set it so 2 nodes up =
> > > > > > OK ?   Or is that a bad idea?
> > > > > >
> > > > > >
> > > > > >  PS:  any other idea on how to increase availability are
> > > > > > welcome .
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer
> > > > > > <chibi at gol.com> wrote:
> > > > > >
> > > > > >>  On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
> > > > > >>
> > > > > >> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
> > > > > >> > >
> > > > > >> > > Hello,
> > > > > >> > >
> > > > > >> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> > > > > >> > >
> > > > > >> > >> Hello Christian,
> > > > > >> > >>
> > > > > >> > >> Let me supply more info and answer some questions.
> > > > > >> > >>
> > > > > >> > >> * Our main concern is high availability, not speed.
> > > > > >> > >> Our storage requirements are not huge.
> > > > > >> > >> However we want good keyboard response 99.99% of the
> > > > > >> > >> time. We mostly do data entry and reporting.   20-25
> > > > > >> > >> users doing mostly order , invoice processing and email.
> > > > > >> > >>
> > > > > >> > >> * DRBD has been very reliable , but I am the SPOF .
> > > > > >> > >> Meaning that when split brain occurs [ every 18-24
> > > > > >> > >> months ] it is me or no one who knows what to do. Try to
> > > > > >> > >> explain how to deal with split brain in advance.... For
> > > > > >> > >> the future ceph looks like it will be easier to maintain.
> > > > > >> > >>
> > > > > >> > > The DRBD people would of course tell you to configure
> > > > > >> > > things in a way that a split brain can't happen. ^o^
> > > > > >> > >
> > > > > >> > > Note that given the right circumstances (too many OSDs
> > > > > >> > > down, MONs
> > > > > >> down)
> > > > > >> > > Ceph can wind up in a similar state.
> > > > > >> >
> > > > > >> >
> > > > > >> > I am not sure what you mean by ceph winding up in a similar
> > > > > >> > state. If you mean regarding 'split brain' in the usual
> > > > > >> > sense of the term, it does not occur in Ceph.  If it does,
> > > > > >> > you have surely found a bug and you should let us know with
> > > > > >> > lots of CAPS.
> > > > > >> >
> > > > > >> > What you can incur though if you have too many monitors
> > > > > >> > down is cluster downtime.  The monitors will ensure you
> > > > > >> > need a strict majority of monitors up in order to operate
> > > > > >> > the cluster, and will not serve requests if said majority
> > > > > >> > is not in place.  The monitors will only serve requests
> > > > > >> > when there's a formed 'quorum', and a quorum is only formed
> > > > > >> > by (N/2)+1 monitors, N being the total number of monitors
> > > > > >> > in the cluster (via the monitor map -- monmap).
> > > > > >> >
> > > > > >> > This said, if out of 3 monitors you have 2 monitors down,
> > > > > >> > your cluster will cease functioning (no admin commands, no
> > > > > >> > writes or reads served). As there is no configuration in
> > > > > >> > which you can have two strict majorities, thus no two
> > > > > >> > partitions of the cluster are able to function at the same
> > > > > >> > time, you do not incur in split brain.
> > > > > >> >
> > > > > >>  I wrote similar state, not "same state".
> > > > > >>
> > > > > >> From a user perspective it is purely semantics how and why
> > > > > >> your shared storage has seized up, the end result is the same.
> > > > > >>
> > > > > >> And yes, that MON example was exactly what I was aiming for,
> > > > > >> your cluster might still have all the data (another potential
> > > > > >> failure mode of cause), but is inaccessible.
> > > > > >>
> > > > > >> DRBD will see and call it a split brain, Ceph will call it a
> > > > > >> Paxos voting failure, it doesn't matter one iota to the poor
> > > > > >> sod relying on that particular storage.
> > > > > >>
> > > > > >> My point was and is, when you design a cluster of whatever
> > > > > >> flavor, make sure you understand how it can (and WILL) fail,
> > > > > >> how to prevent that from happening if at all possible and how
> > > > > >> to recover from it if not.
> > > > > >>
> > > > > >> Potentially (hopefully) in the case of Ceph it would be just
> > > > > >> to get a missing MON back up.
> > > > > >> But given that the failed MON might have a corrupted leveldb
> > > > > >> (it happened to me) will put Robert back into square one, as
> > > > > >> in, a highly qualified engineer has to deal with the issue.
> > > > > >> I.e somebody who can say "screw this dead MON, lets get a new
> > > > > >> one in" and is capable of doing so.
> > > > > >>
> > > > > >> Regards,
> > > > > >>
> > > > > >> Christian
> > > > > >>
> > > > > >> > If you are a creative admin however, you may be able to
> > > > > >> > enforce split brain by modifying monmaps.  In the end you'd
> > > > > >> > obviously end up with two distinct monitor clusters, but if
> > > > > >> > you so happened to not inform the clients about this
> > > > > >> > there's a fair chance that it would cause havoc with
> > > > > >> > unforeseen effects.  Then again, this would be the
> > > > > >> > operator's fault, not Ceph itself -- especially because
> > > > > >> > rewriting monitor maps is not trivial enough for someone to
> > > > > >> > mistakenly do something like this.
> > > > > >> >
> > > > > >> >    -Joao
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >>  Christian Balzer        Network/Systems Engineer
> > > > > >> chibi at gol.com           Global OnLine Japan/Fusion
> > > > > >> Communications http://www.gol.com/
> > > > > >>  _______________________________________________
> > > > > >> ceph-users mailing list
> > > > > >> ceph-users at lists.ceph.com
> > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing
> > > > > > listceph-users at lists.ceph.comhttp://
> > > > lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users at lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > >
> > > >
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi at gol.com           Global OnLine Japan/Fusion Communications
> > > > http://www.gol.com/
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users at lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi at gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/