anti-cephalopod question

robertfantini@xxxxxxxxx (Robert Fantini) · Mon, 28 Jul 2014 18:11:33 -0400

"target replication level of 3"
" with a min of 1 across the node level"

After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/
,   I assume that to accomplish that then set these in ceph.conf   ?

osd pool default size = 3
osd pool default min size = 1

On Mon, Jul 28, 2014 at 2:56 PM, Michael <michael at onlinefusion.co.uk> wrote:

>  If you've two rooms then I'd go for two OSD nodes in each room, a target
> replication level of 3 with a min of 1 across the node level, then have 5
> monitors and put the last monitor outside of either room (The other MON's
> can share with the OSD nodes if needed). Then you've got 'safe' replication
> for OSD/node replacement on failure with some 'shuffle' room for when it's
> needed and either room can be down while the external last monitor allows
> the decisions required to allow a single room to operate.
>
> There's no way you can do a 3/2 MON split that doesn't risk the two nodes
> being up and unable to serve data while the three are down so you'd need to
> find a way to make it a 2/2/1 split instead.
>
> -Michael
>
>
> On 28/07/2014 18:41, Robert Fantini wrote:
>
>  OK for higher availability then  5 nodes is better then 3 .  So we'll
> run 5 .  However we want normal operations with just 2 nodes.   Is that
> possible?
>
>  Eventually 2 nodes will be next building 10 feet away , with a brick wall
> in between.  Connected with Infiniband or better. So one room can go off
> line the other will be on.   The flip of the coin means the 3 node room
> will probably go down.
>  All systems will have dual power supplies connected to different UPS'.
> In addition we have a power generator. Later we'll have a 2-nd generator.
> and then  the UPS's will use different lines attached to those generators
> somehow..
> Also of course we never count on one  cluster  to have our data.  We have
> 2  co-locations with backup going to often using zfs send receive and or
> rsync .
>
>  So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or
> is that a bad idea?
>
>
>  PS:  any other idea on how to increase availability are welcome .
>
>
>
>
>
>
>
>
> On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <chibi at gol.com> wrote:
>
>>  On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
>>
>> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
>> > >
>> > > Hello,
>> > >
>> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
>> > >
>> > >> Hello Christian,
>> > >>
>> > >> Let me supply more info and answer some questions.
>> > >>
>> > >> * Our main concern is high availability, not speed.
>> > >> Our storage requirements are not huge.
>> > >> However we want good keyboard response 99.99% of the time.   We
>> > >> mostly do data entry and reporting.   20-25  users doing mostly
>> > >> order , invoice processing and email.
>> > >>
>> > >> * DRBD has been very reliable , but I am the SPOF .   Meaning that
>> > >> when split brain occurs [ every 18-24 months ] it is me or no one who
>> > >> knows what to do. Try to explain how to deal with split brain in
>> > >> advance.... For the future ceph looks like it will be easier to
>> > >> maintain.
>> > >>
>> > > The DRBD people would of course tell you to configure things in a way
>> > > that a split brain can't happen. ^o^
>> > >
>> > > Note that given the right circumstances (too many OSDs down, MONs
>> down)
>> > > Ceph can wind up in a similar state.
>> >
>> >
>> > I am not sure what you mean by ceph winding up in a similar state.  If
>> > you mean regarding 'split brain' in the usual sense of the term, it does
>> > not occur in Ceph.  If it does, you have surely found a bug and you
>> > should let us know with lots of CAPS.
>> >
>> > What you can incur though if you have too many monitors down is cluster
>> > downtime.  The monitors will ensure you need a strict majority of
>> > monitors up in order to operate the cluster, and will not serve requests
>> > if said majority is not in place.  The monitors will only serve requests
>> > when there's a formed 'quorum', and a quorum is only formed by (N/2)+1
>> > monitors, N being the total number of monitors in the cluster (via the
>> > monitor map -- monmap).
>> >
>> > This said, if out of 3 monitors you have 2 monitors down, your cluster
>> > will cease functioning (no admin commands, no writes or reads served).
>> > As there is no configuration in which you can have two strict
>> > majorities, thus no two partitions of the cluster are able to function
>> > at the same time, you do not incur in split brain.
>> >
>>  I wrote similar state, not "same state".
>>
>> From a user perspective it is purely semantics how and why your shared
>> storage has seized up, the end result is the same.
>>
>> And yes, that MON example was exactly what I was aiming for, your cluster
>> might still have all the data (another potential failure mode of cause),
>> but is inaccessible.
>>
>> DRBD will see and call it a split brain, Ceph will call it a Paxos voting
>> failure, it doesn't matter one iota to the poor sod relying on that
>> particular storage.
>>
>> My point was and is, when you design a cluster of whatever flavor, make
>> sure you understand how it can (and WILL) fail, how to prevent that from
>> happening if at all possible and how to recover from it if not.
>>
>> Potentially (hopefully) in the case of Ceph it would be just to get a
>> missing MON back up.
>> But given that the failed MON might have a corrupted leveldb (it happened
>> to me) will put Robert back into square one, as in, a highly qualified
>> engineer has to deal with the issue.
>> I.e somebody who can say "screw this dead MON, lets get a new one in" and
>> is capable of doing so.
>>
>> Regards,
>>
>> Christian
>>
>> > If you are a creative admin however, you may be able to enforce split
>> > brain by modifying monmaps.  In the end you'd obviously end up with two
>> > distinct monitor clusters, but if you so happened to not inform the
>> > clients about this there's a fair chance that it would cause havoc with
>> > unforeseen effects.  Then again, this would be the operator's fault, not
>> > Ceph itself -- especially because rewriting monitor maps is not trivial
>> > enough for someone to mistakenly do something like this.
>> >
>> >    -Joao
>> >
>> >
>>
>>
>> --
>>  Christian Balzer        Network/Systems Engineer
>> chibi at gol.com           Global OnLine Japan/Fusion Communications
>> http://www.gol.com/
>>  _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> _______________________________________________
> ceph-users mailing listceph-users at lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140728/8426862a/attachment.htm>