Christian - Thank you for the answer, I'll get around to reading 'Crush Maps ' a few times , it is important to have a good understanding of ceph parts. So another question - As long as I keep the same number of nodes in both rooms, will firefly defaults keep data balanced? If not I'll stick with 2 each room until I understand how configure things. On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer <chibi at gol.com> wrote: > > On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote: > > > "target replication level of 3" > > " with a min of 1 across the node level" > > > > After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/ > > , I assume that to accomplish that then set these in ceph.conf ? > > > > osd pool default size = 3 > > osd pool default min size = 1 > > > Not really, the min size specifies how few replicas need to be online > for Ceph to accept IO. > > These (the current Firefly defaults) settings with the default crush map > will have 3 sets of data spread over 3 OSDs and not use the same node > (host) more than once. > So with 2 nodes in each location, a replica will always be both locations. > However if you add more nodes, all of them could wind up in the same > building. > > To prevent this, you have location qualifiers beyond host and you can > modify the crush map to enforce that at least one replica is in a > different rack, row, room, region, etc. > > Advanced material, but one really needs to understand this: > http://ceph.com/docs/master/rados/operations/crush-map/ > > Christian > > > > > > > > > > > > > > > > On Mon, Jul 28, 2014 at 2:56 PM, Michael <michael at onlinefusion.co.uk> > > wrote: > > > > > If you've two rooms then I'd go for two OSD nodes in each room, a > > > target replication level of 3 with a min of 1 across the node level, > > > then have 5 monitors and put the last monitor outside of either room > > > (The other MON's can share with the OSD nodes if needed). Then you've > > > got 'safe' replication for OSD/node replacement on failure with some > > > 'shuffle' room for when it's needed and either room can be down while > > > the external last monitor allows the decisions required to allow a > > > single room to operate. > > > > > > There's no way you can do a 3/2 MON split that doesn't risk the two > > > nodes being up and unable to serve data while the three are down so > > > you'd need to find a way to make it a 2/2/1 split instead. > > > > > > -Michael > > > > > > > > > On 28/07/2014 18:41, Robert Fantini wrote: > > > > > > OK for higher availability then 5 nodes is better then 3 . So we'll > > > run 5 . However we want normal operations with just 2 nodes. Is that > > > possible? > > > > > > Eventually 2 nodes will be next building 10 feet away , with a brick > > > wall in between. Connected with Infiniband or better. So one room can > > > go off line the other will be on. The flip of the coin means the 3 > > > node room will probably go down. > > > All systems will have dual power supplies connected to different UPS'. > > > In addition we have a power generator. Later we'll have a 2-nd > > > generator. and then the UPS's will use different lines attached to > > > those generators somehow.. > > > Also of course we never count on one cluster to have our data. We > > > have 2 co-locations with backup going to often using zfs send receive > > > and or rsync . > > > > > > So for the 5 node cluster, how do we set it so 2 nodes up = OK ? Or > > > is that a bad idea? > > > > > > > > > PS: any other idea on how to increase availability are welcome . > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <chibi at gol.com> > > > wrote: > > > > > >> On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote: > > >> > > >> > On 07/28/2014 08:49 AM, Christian Balzer wrote: > > >> > > > > >> > > Hello, > > >> > > > > >> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote: > > >> > > > > >> > >> Hello Christian, > > >> > >> > > >> > >> Let me supply more info and answer some questions. > > >> > >> > > >> > >> * Our main concern is high availability, not speed. > > >> > >> Our storage requirements are not huge. > > >> > >> However we want good keyboard response 99.99% of the time. We > > >> > >> mostly do data entry and reporting. 20-25 users doing mostly > > >> > >> order , invoice processing and email. > > >> > >> > > >> > >> * DRBD has been very reliable , but I am the SPOF . Meaning > > >> > >> that when split brain occurs [ every 18-24 months ] it is me or > > >> > >> no one who knows what to do. Try to explain how to deal with > > >> > >> split brain in advance.... For the future ceph looks like it > > >> > >> will be easier to maintain. > > >> > >> > > >> > > The DRBD people would of course tell you to configure things in a > > >> > > way that a split brain can't happen. ^o^ > > >> > > > > >> > > Note that given the right circumstances (too many OSDs down, MONs > > >> down) > > >> > > Ceph can wind up in a similar state. > > >> > > > >> > > > >> > I am not sure what you mean by ceph winding up in a similar state. > > >> > If you mean regarding 'split brain' in the usual sense of the term, > > >> > it does not occur in Ceph. If it does, you have surely found a bug > > >> > and you should let us know with lots of CAPS. > > >> > > > >> > What you can incur though if you have too many monitors down is > > >> > cluster downtime. The monitors will ensure you need a strict > > >> > majority of monitors up in order to operate the cluster, and will > > >> > not serve requests if said majority is not in place. The monitors > > >> > will only serve requests when there's a formed 'quorum', and a > > >> > quorum is only formed by (N/2)+1 monitors, N being the total number > > >> > of monitors in the cluster (via the monitor map -- monmap). > > >> > > > >> > This said, if out of 3 monitors you have 2 monitors down, your > > >> > cluster will cease functioning (no admin commands, no writes or > > >> > reads served). As there is no configuration in which you can have > > >> > two strict majorities, thus no two partitions of the cluster are > > >> > able to function at the same time, you do not incur in split brain. > > >> > > > >> I wrote similar state, not "same state". > > >> > > >> From a user perspective it is purely semantics how and why your shared > > >> storage has seized up, the end result is the same. > > >> > > >> And yes, that MON example was exactly what I was aiming for, your > > >> cluster might still have all the data (another potential failure mode > > >> of cause), but is inaccessible. > > >> > > >> DRBD will see and call it a split brain, Ceph will call it a Paxos > > >> voting failure, it doesn't matter one iota to the poor sod relying on > > >> that particular storage. > > >> > > >> My point was and is, when you design a cluster of whatever flavor, > > >> make sure you understand how it can (and WILL) fail, how to prevent > > >> that from happening if at all possible and how to recover from it if > > >> not. > > >> > > >> Potentially (hopefully) in the case of Ceph it would be just to get a > > >> missing MON back up. > > >> But given that the failed MON might have a corrupted leveldb (it > > >> happened to me) will put Robert back into square one, as in, a highly > > >> qualified engineer has to deal with the issue. > > >> I.e somebody who can say "screw this dead MON, lets get a new one in" > > >> and is capable of doing so. > > >> > > >> Regards, > > >> > > >> Christian > > >> > > >> > If you are a creative admin however, you may be able to enforce > > >> > split brain by modifying monmaps. In the end you'd obviously end > > >> > up with two distinct monitor clusters, but if you so happened to > > >> > not inform the clients about this there's a fair chance that it > > >> > would cause havoc with unforeseen effects. Then again, this would > > >> > be the operator's fault, not Ceph itself -- especially because > > >> > rewriting monitor maps is not trivial enough for someone to > > >> > mistakenly do something like this. > > >> > > > >> > -Joao > > >> > > > >> > > > >> > > >> > > >> -- > > >> Christian Balzer Network/Systems Engineer > > >> chibi at gol.com Global OnLine Japan/Fusion Communications > > >> http://www.gol.com/ > > >> _______________________________________________ > > >> ceph-users mailing list > > >> ceph-users at lists.ceph.com > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >> > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing > > > listceph-users at lists.ceph.comhttp:// > lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users at lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > -- > Christian Balzer Network/Systems Engineer > chibi at gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140729/4878d369/attachment.htm>