Re: Mon placement over wide area

Christian Balzer <chibi@xxxxxxx> · Tue, 12 Apr 2016 10:13:47 +0900

Hello (again),

On Tue, 12 Apr 2016 00:46:29 +0000 Adrian Saul wrote:

> 
> We are close to being given approval to deploy a 3.5PB Ceph cluster that
> will be distributed over every major capital in Australia.    The config
> will be dual sites in each city that will be coupled as HA pairs - 12
> sites in total.   The vast majority of CRUSH rules will place data
> either locally to the individual site, or replicated to the other HA
> site in that city.   However there are future use cases where I think we
> could use EC to distribute data wider or have some replication that puts
> small data sets across multiple cities.   
This will very, very, VERY much depend on the data (use case) in question.

>All of this will be tied
> together with a dedicated private IP network.
> 
> The concern I have is around the placement of mons.  In the current
> design there would be two monitors in each site, running separate to the
> OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> will also be a "tiebreaker" mon placed on a separate host which will
> house some management infrastructure for the whole platform.
> 
Yes, that's the preferable way, might want to up this to 5 mons so you can
loose one while doing maintenance on another one.
But if that would be a coupled, national cluster you're looking both at
significant MON traffic, interesting "split-brain" scenarios and latencies
as well (MONs get chosen randomly by clients AFAIK).

> Obviously a concern is latency - the east coast to west coast latency is
> around 50ms, and on the east coast it is 12ms between Sydney and the
> other two sites, and 24ms Melbourne to Brisbane.  
In any situation other than "write speed doesn't matter at all" combined
with "large writes, not small ones" and "read-mostly" you're going to be in
severe pain.

> Most of the data
> traffic will remain local but if we create a single national cluster
> then how much of an impact will it be having all the mons needing to
> keep in sync, as well as monitor and communicate with all OSDs (in the
> end goal design there will be some 2300+ OSDs).
> 
Significant. 
I wouldn't suggest it, but even if you deploy differently I'd suggest a
test run/setup and sharing the experience with us. ^.^

> The other options I  am considering:
> - split into east and west coast clusters, most of the cross city need
> is in the east coast, any data moves between clusters can be done with
> snap replication
> - city based clusters (tightest latency) but loose the multi-DC EC
> option, do cross city replication using snapshots
> 
The later, I seem to remember that there was work in progress to do this
(snapshot replication) in an automated fashion.

> Just want to get a feel for what I need to consider when we start
> building at this scale.
> 
I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
the only well known/supported way to do geo-replication with Ceph is via
RGW.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com