Hello again Christian :) > > We are close to being given approval to deploy a 3.5PB Ceph cluster that > > will be distributed over every major capital in Australia. The config > > will be dual sites in each city that will be coupled as HA pairs - 12 > > sites in total. The vast majority of CRUSH rules will place data > > either locally to the individual site, or replicated to the other HA > > site in that city. However there are future use cases where I think we > > could use EC to distribute data wider or have some replication that puts > > small data sets across multiple cities. > This will very, very, VERY much depend on the data (use case) in question. The EC use case would be using RGW and to act as an archival backup store > > The concern I have is around the placement of mons. In the current > > design there would be two monitors in each site, running separate to the > > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways. There > > will also be a "tiebreaker" mon placed on a separate host which will > > house some management infrastructure for the whole platform. > > > Yes, that's the preferable way, might want to up this to 5 mons so you can > loose one while doing maintenance on another one. > But if that would be a coupled, national cluster you're looking both at > significant MON traffic, interesting "split-brain" scenarios and latencies as > well (MONs get chosen randomly by clients AFAIK). In the case I am setting up it would be 2 per site plus the extra so 25 - but I am fearing that would make the mon syncing become to heavy. Once we build up to multiple sites though we can maybe reduce to one per site to reduce the workload on keeping the mons in sync. > > Obviously a concern is latency - the east coast to west coast latency > > is around 50ms, and on the east coast it is 12ms between Sydney and > > the other two sites, and 24ms Melbourne to Brisbane. > In any situation other than "write speed doesn't matter at all" combined with > "large writes, not small ones" and "read-mostly" you're going to be in severe > pain. For data yes, but the main case for that would be backup data where it would be large writes, read rarely and as long as streaming performance keeps up latency wont matter. My concern with the latency would be how that impacts the monitors having to keep in sync and how that would impact client opertions, especially with the rate of change that would occur with the predominant RBD use in most sites. > > Most of the data > > traffic will remain local but if we create a single national cluster > > then how much of an impact will it be having all the mons needing to > > keep in sync, as well as monitor and communicate with all OSDs (in the > > end goal design there will be some 2300+ OSDs). > > > Significant. > I wouldn't suggest it, but even if you deploy differently I'd suggest a test > run/setup and sharing the experience with us. ^.^ Someone has to be the canary right :) > > The other options I am considering: > > - split into east and west coast clusters, most of the cross city need > > is in the east coast, any data moves between clusters can be done with > > snap replication > > - city based clusters (tightest latency) but loose the multi-DC EC > > option, do cross city replication using snapshots > > > The later, I seem to remember that there was work in progress to do this > (snapshot replication) in an automated fashion. > > > Just want to get a feel for what I need to consider when we start > > building at this scale. > > > I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but > the only well known/supported way to do geo-replication with Ceph is via > RGW. iSCSI is working fairly well. We have decided to not use Ceph for the latency sensitive workloads so while we are still working to keep that low, we wont be putting the heavier IOP or latency sensitive workloads onto it until we get a better feel for how it behaves at scale and can be sure of the performance. As above - for the most part we are going to be for the most part having local site pools (replicate at application level), a few metro replicated pools and a couple of very small multi-metro replicated pools, with the geo-redundant EC stuff a future plan. It would just be a shame to lock the design into a setup that won't let us do some of these wider options down the track. Thanks. Adrian Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com