Hello (again), On Tue, 12 Apr 2016 00:46:29 +0000 Adrian Saul wrote: > > We are close to being given approval to deploy a 3.5PB Ceph cluster that > will be distributed over every major capital in Australia. The config > will be dual sites in each city that will be coupled as HA pairs - 12 > sites in total. The vast majority of CRUSH rules will place data > either locally to the individual site, or replicated to the other HA > site in that city. However there are future use cases where I think we > could use EC to distribute data wider or have some replication that puts > small data sets across multiple cities. This will very, very, VERY much depend on the data (use case) in question. >All of this will be tied > together with a dedicated private IP network. > > The concern I have is around the placement of mons. In the current > design there would be two monitors in each site, running separate to the > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways. There > will also be a "tiebreaker" mon placed on a separate host which will > house some management infrastructure for the whole platform. > Yes, that's the preferable way, might want to up this to 5 mons so you can loose one while doing maintenance on another one. But if that would be a coupled, national cluster you're looking both at significant MON traffic, interesting "split-brain" scenarios and latencies as well (MONs get chosen randomly by clients AFAIK). > Obviously a concern is latency - the east coast to west coast latency is > around 50ms, and on the east coast it is 12ms between Sydney and the > other two sites, and 24ms Melbourne to Brisbane. In any situation other than "write speed doesn't matter at all" combined with "large writes, not small ones" and "read-mostly" you're going to be in severe pain. > Most of the data > traffic will remain local but if we create a single national cluster > then how much of an impact will it be having all the mons needing to > keep in sync, as well as monitor and communicate with all OSDs (in the > end goal design there will be some 2300+ OSDs). > Significant. I wouldn't suggest it, but even if you deploy differently I'd suggest a test run/setup and sharing the experience with us. ^.^ > The other options I am considering: > - split into east and west coast clusters, most of the cross city need > is in the east coast, any data moves between clusters can be done with > snap replication > - city based clusters (tightest latency) but loose the multi-DC EC > option, do cross city replication using snapshots > The later, I seem to remember that there was work in progress to do this (snapshot replication) in an automated fashion. > Just want to get a feel for what I need to consider when we start > building at this scale. > I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but the only well known/supported way to do geo-replication with Ceph is via RGW. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com