Re: multi-datacenter crush map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ok, so if I understand correctly, for replication level 3 or 4 I would have to use the rule 

rule replicated_ruleset {
    ruleset 0
type replicated
min_size 1
max_size 10
step take root
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
}
The question I have now is: how will it behave when a DC goes down? (Assuming catastrophic failure, the thing burns down)

For example, if I set replication to 3, min_rep to 3.
Then, if a DC goes down, crush will only return 2 PG's, so everything will hang  (same for 4/4 and 4/3)

If I set replication to 3, min_rep to 2, it could occur that all data of a PG is in one DC (degraded mode). if this DC goes down, the PG will hang,....
As far as I know, degraded PG's will still accept writes, so data loss is possible. (same for 4/2)



I can't seem to find a way around this. What am I missing.


Wouter




On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.deborger@xxxxxxxxx> wrote:
> Hi all,
>
> I have found on the mailing list that it should be possible to have a multi
> datacenter setup, if latency is low enough.
>
> I would like to set this up, so that each datacenter has at least two
> replicas and each PG has a replication level of 3.
>
> In this mail, it is suggested that I should use the following crush map for
> multi DC:
>
> rule dc {
>     ruleset 0
>     type replicated
>     min_size 1
>     max_size 10
>     step take default
>     step chooseleaf firstn 0 type datacenter
>     step emit
> }
>
> This looks suspicious to me, as it will only generate a list of two PG's,
> (and only one PG if one DC is down).
>
> I think I should use:
>
> rule replicated_ruleset {
>     ruleset 0
>     type replicated
>     min_size 1
>     max_size 10
>     step take root
>     step choose firstn 2 type datacenter
>     step chooseleaf firstn 2 type host
>     step emit
>     step take root
>     step chooseleaf firstn -4 type host
>     step emit
> }
>
> This correctly generates a list with 2 PG's in one DC, then 2 PG's in the
> other and then a list of PG's
>
> The problem is that this list contains duplicates (e.g. for 8 OSDS per DC)
>
> [13,11,1,8,13,11,16,4,3,7]
> [9,2,13,11,9,15,12,18,3,5]
> [3,5,17,10,3,5,7,13,18,10]
> [7,6,11,14,7,14,3,16,4,11]
> [6,3,15,18,6,3,12,9,16,15]
>
> Will this be a problem?

For replicated pools, it probably will cause trouble. For EC pools I
think it should work fine, but obviously you're losing all kinds of
redundancy. Nothing in the system will do work to avoid colocating
them if you use a rule like this. Rather than distributing some of the
replicas randomly across DCs, you really just want to split them up
evenly across datacenters (or in some ratio, if one has more space
than the other). Given CRUSH's current abilities that does require
building the replication size into the rule, but such is life.


> If crush is executed, will it only consider osd's which are (up,in)  or all
> OSD's in the map and then filter them from the list afterwards?

CRUSH will consider all OSDs, but if it selects any OSDs which are out
then it retries until it gets one that is still marked in.
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux