anti-cephalopod question

robertfantini@xxxxxxxxx (Robert Fantini) · Tue, 29 Jul 2014 06:33:14 -0400

Christian -
 Thank you for the answer,   I'll get around to reading 'Crush Maps '  a
few times  ,  it is important to have a good understanding of ceph parts.

 So another question -

 As long as I keep the same number of nodes in both rooms, will  firefly
defaults keep data balanced?

If not I'll stick with 2 each room until I understand how configure things.

On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer <chibi at gol.com> wrote:

>
> On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote:
>
> > "target replication level of 3"
> > " with a min of 1 across the node level"
> >
> > After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/
> > ,   I assume that to accomplish that then set these in ceph.conf   ?
> >
> > osd pool default size = 3
> > osd pool default min size = 1
> >
> Not really, the min size specifies how few replicas need to be online
> for Ceph to accept IO.
>
> These (the current Firefly defaults) settings with the default crush map
> will have 3 sets of data spread over 3 OSDs and not use the same node
> (host) more than once.
> So with 2 nodes in each location, a replica will always be both locations.
> However if you add more nodes, all of them could wind up in the same
> building.
>
> To prevent this, you have location qualifiers beyond host and you can
> modify the crush map to enforce that at least one replica is in a
> different rack, row, room, region, etc.
>
> Advanced material, but one really needs to understand this:
> http://ceph.com/docs/master/rados/operations/crush-map/
>
> Christian
>
>
> >
> >
> >
> >
> >
> >
> > On Mon, Jul 28, 2014 at 2:56 PM, Michael <michael at onlinefusion.co.uk>
> > wrote:
> >
> > >  If you've two rooms then I'd go for two OSD nodes in each room, a
> > > target replication level of 3 with a min of 1 across the node level,
> > > then have 5 monitors and put the last monitor outside of either room
> > > (The other MON's can share with the OSD nodes if needed). Then you've
> > > got 'safe' replication for OSD/node replacement on failure with some
> > > 'shuffle' room for when it's needed and either room can be down while
> > > the external last monitor allows the decisions required to allow a
> > > single room to operate.
> > >
> > > There's no way you can do a 3/2 MON split that doesn't risk the two
> > > nodes being up and unable to serve data while the three are down so
> > > you'd need to find a way to make it a 2/2/1 split instead.
> > >
> > > -Michael
> > >
> > >
> > > On 28/07/2014 18:41, Robert Fantini wrote:
> > >
> > >  OK for higher availability then  5 nodes is better then 3 .  So we'll
> > > run 5 .  However we want normal operations with just 2 nodes.   Is that
> > > possible?
> > >
> > >  Eventually 2 nodes will be next building 10 feet away , with a brick
> > > wall in between.  Connected with Infiniband or better. So one room can
> > > go off line the other will be on.   The flip of the coin means the 3
> > > node room will probably go down.
> > >  All systems will have dual power supplies connected to different UPS'.
> > > In addition we have a power generator. Later we'll have a 2-nd
> > > generator. and then  the UPS's will use different lines attached to
> > > those generators somehow..
> > > Also of course we never count on one  cluster  to have our data.  We
> > > have 2  co-locations with backup going to often using zfs send receive
> > > and or rsync .
> > >
> > >  So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or
> > > is that a bad idea?
> > >
> > >
> > >  PS:  any other idea on how to increase availability are welcome .
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <chibi at gol.com>
> > > wrote:
> > >
> > >>  On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
> > >>
> > >> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
> > >> > >
> > >> > > Hello,
> > >> > >
> > >> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> > >> > >
> > >> > >> Hello Christian,
> > >> > >>
> > >> > >> Let me supply more info and answer some questions.
> > >> > >>
> > >> > >> * Our main concern is high availability, not speed.
> > >> > >> Our storage requirements are not huge.
> > >> > >> However we want good keyboard response 99.99% of the time.   We
> > >> > >> mostly do data entry and reporting.   20-25  users doing mostly
> > >> > >> order , invoice processing and email.
> > >> > >>
> > >> > >> * DRBD has been very reliable , but I am the SPOF .   Meaning
> > >> > >> that when split brain occurs [ every 18-24 months ] it is me or
> > >> > >> no one who knows what to do. Try to explain how to deal with
> > >> > >> split brain in advance.... For the future ceph looks like it
> > >> > >> will be easier to maintain.
> > >> > >>
> > >> > > The DRBD people would of course tell you to configure things in a
> > >> > > way that a split brain can't happen. ^o^
> > >> > >
> > >> > > Note that given the right circumstances (too many OSDs down, MONs
> > >> down)
> > >> > > Ceph can wind up in a similar state.
> > >> >
> > >> >
> > >> > I am not sure what you mean by ceph winding up in a similar state.
> > >> > If you mean regarding 'split brain' in the usual sense of the term,
> > >> > it does not occur in Ceph.  If it does, you have surely found a bug
> > >> > and you should let us know with lots of CAPS.
> > >> >
> > >> > What you can incur though if you have too many monitors down is
> > >> > cluster downtime.  The monitors will ensure you need a strict
> > >> > majority of monitors up in order to operate the cluster, and will
> > >> > not serve requests if said majority is not in place.  The monitors
> > >> > will only serve requests when there's a formed 'quorum', and a
> > >> > quorum is only formed by (N/2)+1 monitors, N being the total number
> > >> > of monitors in the cluster (via the monitor map -- monmap).
> > >> >
> > >> > This said, if out of 3 monitors you have 2 monitors down, your
> > >> > cluster will cease functioning (no admin commands, no writes or
> > >> > reads served). As there is no configuration in which you can have
> > >> > two strict majorities, thus no two partitions of the cluster are
> > >> > able to function at the same time, you do not incur in split brain.
> > >> >
> > >>  I wrote similar state, not "same state".
> > >>
> > >> From a user perspective it is purely semantics how and why your shared
> > >> storage has seized up, the end result is the same.
> > >>
> > >> And yes, that MON example was exactly what I was aiming for, your
> > >> cluster might still have all the data (another potential failure mode
> > >> of cause), but is inaccessible.
> > >>
> > >> DRBD will see and call it a split brain, Ceph will call it a Paxos
> > >> voting failure, it doesn't matter one iota to the poor sod relying on
> > >> that particular storage.
> > >>
> > >> My point was and is, when you design a cluster of whatever flavor,
> > >> make sure you understand how it can (and WILL) fail, how to prevent
> > >> that from happening if at all possible and how to recover from it if
> > >> not.
> > >>
> > >> Potentially (hopefully) in the case of Ceph it would be just to get a
> > >> missing MON back up.
> > >> But given that the failed MON might have a corrupted leveldb (it
> > >> happened to me) will put Robert back into square one, as in, a highly
> > >> qualified engineer has to deal with the issue.
> > >> I.e somebody who can say "screw this dead MON, lets get a new one in"
> > >> and is capable of doing so.
> > >>
> > >> Regards,
> > >>
> > >> Christian
> > >>
> > >> > If you are a creative admin however, you may be able to enforce
> > >> > split brain by modifying monmaps.  In the end you'd obviously end
> > >> > up with two distinct monitor clusters, but if you so happened to
> > >> > not inform the clients about this there's a fair chance that it
> > >> > would cause havoc with unforeseen effects.  Then again, this would
> > >> > be the operator's fault, not Ceph itself -- especially because
> > >> > rewriting monitor maps is not trivial enough for someone to
> > >> > mistakenly do something like this.
> > >> >
> > >> >    -Joao
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >>  Christian Balzer        Network/Systems Engineer
> > >> chibi at gol.com           Global OnLine Japan/Fusion Communications
> > >> http://www.gol.com/
> > >>  _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users at lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing
> > > listceph-users at lists.ceph.comhttp://
> lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi at gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140729/4878d369/attachment.htm>