Re: multi-datacenter crush map

Wouter De Borger <w.deborger@xxxxxxxxx> · Sat, 19 Sep 2015 20:54:18 +0200

Ok, so if I understand correctly, for replication level 3 or 4 I would have to use the rule 
rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take root
    step choose firstn 2 type datacenter
    step chooseleaf firstn 2 type host
    step emit
}
The question I have now is: how will it behave when a DC goes down? (Assuming catastrophic failure, the thing burns down)

For example, if I set replication to 3, min_rep to 3.
Then, if a DC goes down, crush will only return 2 PG's, so everything will hang  (same for 4/4 and 4/3)

If I set replication to 3, min_rep to 2, it could occur that all data of a PG is in one DC (degraded mode). if this DC goes down, the PG will hang,....
As far as I know, degraded PG's will still accept writes, so data loss is possible. (same for 4/2)

I can't seem to find a way around this. What am I missing.

Wouter

On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.deborger@xxxxxxxxx> wrote:

> Hi all,

>

> I have found on the mailing list that it should be possible to have a multi

> datacenter setup, if latency is low enough.

>

> I would like to set this up, so that each datacenter has at least two

> replicas and each PG has a replication level of 3.

>

> In this mail, it is suggested that I should use the following crush map for

> multi DC:

>

> rule dc {

>     ruleset 0

>     type replicated

>     min_size 1

>     max_size 10

>     step take default

>     step chooseleaf firstn 0 type datacenter

>     step emit

> }

>

> This looks suspicious to me, as it will only generate a list of two PG's,

> (and only one PG if one DC is down).

>

> I think I should use:

>

> rule replicated_ruleset {

>     ruleset 0

>     type replicated

>     min_size 1

>     max_size 10

>     step take root

>     step choose firstn 2 type datacenter

>     step chooseleaf firstn 2 type host

>     step emit

>     step take root

>     step chooseleaf firstn -4 type host

>     step emit

> }

>

> This correctly generates a list with 2 PG's in one DC, then 2 PG's in the

> other and then a list of PG's

>

> The problem is that this list contains duplicates (e.g. for 8 OSDS per DC)

>

> [13,11,1,8,13,11,16,4,3,7]

> [9,2,13,11,9,15,12,18,3,5]

> [3,5,17,10,3,5,7,13,18,10]

> [7,6,11,14,7,14,3,16,4,11]

> [6,3,15,18,6,3,12,9,16,15]

>

> Will this be a problem?

For replicated pools, it probably will cause trouble. For EC pools I

think it should work fine, but obviously you're losing all kinds of

redundancy. Nothing in the system will do work to avoid colocating

them if you use a rule like this. Rather than distributing some of the

replicas randomly across DCs, you really just want to split them up

evenly across datacenters (or in some ratio, if one has more space

than the other). Given CRUSH's current abilities that does require

building the replication size into the rule, but such is life.

> If crush is executed, will it only consider osd's which are (up,in)  or all

> OSD's in the map and then filter them from the list afterwards?

CRUSH will consider all OSDs, but if it selects any OSDs which are out

then it retries until it gets one that is still marked in.

-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com