Re: multi-datacenter crush map

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 18 Sep 2015 13:10:41 -0700

On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.deborger@xxxxxxxxx> wrote:
> Hi all,
>
> I have found on the mailing list that it should be possible to have a multi
> datacenter setup, if latency is low enough.
>
> I would like to set this up, so that each datacenter has at least two
> replicas and each PG has a replication level of 3.
>
> In this mail, it is suggested that I should use the following crush map for
> multi DC:
>
> rule dc {
>     ruleset 0
>     type replicated
>     min_size 1
>     max_size 10
>     step take default
>     step chooseleaf firstn 0 type datacenter
>     step emit
> }
>
> This looks suspicious to me, as it will only generate a list of two PG's,
> (and only one PG if one DC is down).
>
> I think I should use:
>
> rule replicated_ruleset {
>     ruleset 0
>     type replicated
>     min_size 1
>     max_size 10
>     step take root
>     step choose firstn 2 type datacenter
>     step chooseleaf firstn 2 type host
>     step emit
>     step take root
>     step chooseleaf firstn -4 type host
>     step emit
> }
>
> This correctly generates a list with 2 PG's in one DC, then 2 PG's in the
> other and then a list of PG's
>
> The problem is that this list contains duplicates (e.g. for 8 OSDS per DC)
>
> [13,11,1,8,13,11,16,4,3,7]
> [9,2,13,11,9,15,12,18,3,5]
> [3,5,17,10,3,5,7,13,18,10]
> [7,6,11,14,7,14,3,16,4,11]
> [6,3,15,18,6,3,12,9,16,15]
>
> Will this be a problem?

For replicated pools, it probably will cause trouble. For EC pools I
think it should work fine, but obviously you're losing all kinds of
redundancy. Nothing in the system will do work to avoid colocating
them if you use a rule like this. Rather than distributing some of the
replicas randomly across DCs, you really just want to split them up
evenly across datacenters (or in some ratio, if one has more space
than the other). Given CRUSH's current abilities that does require
building the replication size into the rule, but such is life.

> If crush is executed, will it only consider osd's which are (up,in)  or all
> OSD's in the map and then filter them from the list afterwards?

CRUSH will consider all OSDs, but if it selects any OSDs which are out
then it retries until it gets one that is still marked in.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com