If you've two rooms then I'd go for two OSD nodes in each room, a target replication level of 3 with a min of 1 across the node level, then have 5 monitors and put the last monitor outside of either room (The other MON's can share with the OSD nodes if needed). Then you've got 'safe' replication for OSD/node replacement on failure with some 'shuffle' room for when it's needed and either room can be down while the external last monitor allows the decisions required to allow a single room to operate. There's no way you can do a 3/2 MON split that doesn't risk the two nodes being up and unable to serve data while the three are down so you'd need to find a way to make it a 2/2/1 split instead. -Michael On 28/07/2014 18:41, Robert Fantini wrote: > OK for higher availability then 5 nodes is better then 3 . So we'll > run 5 . However we want normal operations with just 2 nodes. Is > that possible? > > Eventually 2 nodes will be next building 10 feet away , with a brick > wall in between. Connected with Infiniband or better. So one room can > go off line the other will be on. The flip of the coin means the 3 > node room will probably go down. > All systems will have dual power supplies connected to different > UPS'. In addition we have a power generator. Later we'll have a 2-nd > generator. and then the UPS's will use different lines attached to > those generators somehow.. > Also of course we never count on one cluster to have our data. We > have 2 co-locations with backup going to often using zfs send receive > and or rsync . > > So for the 5 node cluster, how do we set it so 2 nodes up = OK ? Or > is that a bad idea? > > > PS: any other idea on how to increase availability are welcome . > > > > > > > > > On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <chibi at gol.com > <mailto:chibi at gol.com>> wrote: > > On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote: > > > On 07/28/2014 08:49 AM, Christian Balzer wrote: > > > > > > Hello, > > > > > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote: > > > > > >> Hello Christian, > > >> > > >> Let me supply more info and answer some questions. > > >> > > >> * Our main concern is high availability, not speed. > > >> Our storage requirements are not huge. > > >> However we want good keyboard response 99.99% of the time. We > > >> mostly do data entry and reporting. 20-25 users doing mostly > > >> order , invoice processing and email. > > >> > > >> * DRBD has been very reliable , but I am the SPOF . Meaning > that > > >> when split brain occurs [ every 18-24 months ] it is me or no > one who > > >> knows what to do. Try to explain how to deal with split brain in > > >> advance.... For the future ceph looks like it will be easier to > > >> maintain. > > >> > > > The DRBD people would of course tell you to configure things > in a way > > > that a split brain can't happen. ^o^ > > > > > > Note that given the right circumstances (too many OSDs down, > MONs down) > > > Ceph can wind up in a similar state. > > > > > > I am not sure what you mean by ceph winding up in a similar > state. If > > you mean regarding 'split brain' in the usual sense of the term, > it does > > not occur in Ceph. If it does, you have surely found a bug and you > > should let us know with lots of CAPS. > > > > What you can incur though if you have too many monitors down is > cluster > > downtime. The monitors will ensure you need a strict majority of > > monitors up in order to operate the cluster, and will not serve > requests > > if said majority is not in place. The monitors will only serve > requests > > when there's a formed 'quorum', and a quorum is only formed by > (N/2)+1 > > monitors, N being the total number of monitors in the cluster > (via the > > monitor map -- monmap). > > > > This said, if out of 3 monitors you have 2 monitors down, your > cluster > > will cease functioning (no admin commands, no writes or reads > served). > > As there is no configuration in which you can have two strict > > majorities, thus no two partitions of the cluster are able to > function > > at the same time, you do not incur in split brain. > > > I wrote similar state, not "same state". > > From a user perspective it is purely semantics how and why your shared > storage has seized up, the end result is the same. > > And yes, that MON example was exactly what I was aiming for, your > cluster > might still have all the data (another potential failure mode of > cause), > but is inaccessible. > > DRBD will see and call it a split brain, Ceph will call it a Paxos > voting > failure, it doesn't matter one iota to the poor sod relying on that > particular storage. > > My point was and is, when you design a cluster of whatever flavor, > make > sure you understand how it can (and WILL) fail, how to prevent > that from > happening if at all possible and how to recover from it if not. > > Potentially (hopefully) in the case of Ceph it would be just to get a > missing MON back up. > But given that the failed MON might have a corrupted leveldb (it > happened > to me) will put Robert back into square one, as in, a highly qualified > engineer has to deal with the issue. > I.e somebody who can say "screw this dead MON, lets get a new one > in" and > is capable of doing so. > > Regards, > > Christian > > > If you are a creative admin however, you may be able to enforce > split > > brain by modifying monmaps. In the end you'd obviously end up > with two > > distinct monitor clusters, but if you so happened to not inform the > > clients about this there's a fair chance that it would cause > havoc with > > unforeseen effects. Then again, this would be the operator's > fault, not > > Ceph itself -- especially because rewriting monitor maps is not > trivial > > enough for someone to mistakenly do something like this. > > > > -Joao > > > > > > > -- > Christian Balzer Network/Systems Engineer > chibi at gol.com <mailto:chibi at gol.com> Global OnLine > Japan/Fusion Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140728/fe7ed8cc/attachment.htm>