Re: multi-datacenter crush map

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Sat, 19 Sep 2015 13:31:58 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

You will want size=4 min_size=2 if you want to keep I/O going if a DC
fails and ensure some data integrity. Data checksumming (I think is
being added) would provide much stronger data integrity checking in a
two copy situation as you would be able to tell which of the two
copies is the good copy instead of needing a third to break the tie.

However, you have yet another problem on your hands. The way monitors
works makes this tricky. If you have one monitor in one DC and two in
the other, if the two monitor DC burns down, the surviving cluster
stops working too because there isn't more than 50% of the monitors
available. Putting two monitors in each DC only causes both to stop
working if one goes down (you need three to make a quorum). It has
been suggested that putting the odd monitor in the cloud (or other
off-site location to both DCs) could be an option, but latency could
cause problems. The cloud monitor would complete the quorum with
whichever DC survives.

Also remember that there is no data locality awareness in Ceph at the
moment. This could mean that the primary for a PG is in the other DC.
So your client has to contact the primary in the other DC, then that
OSD contacts one OSD in it's DC and two in the other and has to get
confirmation that the write is acknowledged then ack the write to the
client. For a write you will be between 2 x ( LAN latency + WAN
latency ) and 2 x ( LAN latency + 2 x WAN latency ). Additionally your
reads will be between 2 x LAN latency and 2 x WAN latency. Then there
is write amplification so you need to make sure you have a lot more
WAN bandwidth than you think you need.

I think the large majority of us are eagerly waiting for the RBD
replication feature or some sort of lag behind OSD for situations like
this.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/bgrCRDmVDuy+mK58QAAOTsQAJd5Q/3uVMP6D0U+iZv/
FGvEThfxLqarEo/n/TAPiJdCeZP9sKr8szTP72Iajt5UAwH8Ry5qcvClUoet
LmMXfOxHJQJcMbXcKHxI8G7w9h/8ExkGA3GkoBYltUvZ9+oEI30ANHZphBiK
HhaLWanrEKh8L4EbXnqA9JvEYwf1BGDvxKbdvFDNSIIbDywN3DJn7OavRhC9
M63GQnFmxSO6F+Oy1q5vMfpur/VtZ27GRfzIDsougRTmM5q9zbdpSY8pHrrZ
RDExkM1t0orl1gUnbNhl/YgQTGfU/XWpEKtJju7Wk9Ciem5SFczJRWsputHc
AhBtnxBoEInlsnpHKnCsPvbY8wEcoo+YxNt79/M3cR8x0UzXl+/4SoDlYnSK
X3afL/YmVnbCV6hoxl2LAOqHbTYasN9VxQIbpQe4kAzSq45yJX//k8NRXBfD
+hGF8qfxpcbTe/9IjJiqwe+ZpaAd4vX7Xfq4oHHeMwWUrvd8sXSbr5CIV1AJ
CYsixEy2gJ0oFFVKcBGtzAfBUxJHb/FAcAuV97zSdYyYRplMq5Qjaz/hwGeu
9pC83kxY40pfzdD9uEElWoI3+6/34LdNo4TLi3IM8aeZmNGzzIgt/MxAuFOk
9Jf2Dwmab0+Ut6uJasY4Fr6HiyNoeTXea+CSWrnvsMohOseyJg996GUP3gUl
OEoA
=PfhN
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Sat, Sep 19, 2015 at 12:54 PM, Wouter De Borger <w.deborger@xxxxxxxxx> wrote:
> Ok, so if I understand correctly, for replication level 3 or 4 I would have
> to use the rule
>
> rule replicated_ruleset {
>
>     ruleset 0
>     type replicated
>     min_size 1
>     max_size 10
>     step take root
>     step choose firstn 2 type datacenter
>     step chooseleaf firstn 2 type host
>     step emit
> }
>
> The question I have now is: how will it behave when a DC goes down?
> (Assuming catastrophic failure, the thing burns down)
>
> For example, if I set replication to 3, min_rep to 3.
> Then, if a DC goes down, crush will only return 2 PG's, so everything will
> hang  (same for 4/4 and 4/3)
>
> If I set replication to 3, min_rep to 2, it could occur that all data of a
> PG is in one DC (degraded mode). if this DC goes down, the PG will hang,....
> As far as I know, degraded PG's will still accept writes, so data loss is
> possible. (same for 4/2)
>
>
>
> I can't seem to find a way around this. What am I missing.
>
>
> Wouter
>
>
>
>
> On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>
>> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.deborger@xxxxxxxxx>
>> wrote:
>> > Hi all,
>> >
>> > I have found on the mailing list that it should be possible to have a
>> > multi
>> > datacenter setup, if latency is low enough.
>> >
>> > I would like to set this up, so that each datacenter has at least two
>> > replicas and each PG has a replication level of 3.
>> >
>> > In this mail, it is suggested that I should use the following crush map
>> > for
>> > multi DC:
>> >
>> > rule dc {
>> >     ruleset 0
>> >     type replicated
>> >     min_size 1
>> >     max_size 10
>> >     step take default
>> >     step chooseleaf firstn 0 type datacenter
>> >     step emit
>> > }
>> >
>> > This looks suspicious to me, as it will only generate a list of two
>> > PG's,
>> > (and only one PG if one DC is down).
>> >
>> > I think I should use:
>> >
>> > rule replicated_ruleset {
>> >     ruleset 0
>> >     type replicated
>> >     min_size 1
>> >     max_size 10
>> >     step take root
>> >     step choose firstn 2 type datacenter
>> >     step chooseleaf firstn 2 type host
>> >     step emit
>> >     step take root
>> >     step chooseleaf firstn -4 type host
>> >     step emit
>> > }
>> >
>> > This correctly generates a list with 2 PG's in one DC, then 2 PG's in
>> > the
>> > other and then a list of PG's
>> >
>> > The problem is that this list contains duplicates (e.g. for 8 OSDS per
>> > DC)
>> >
>> > [13,11,1,8,13,11,16,4,3,7]
>> > [9,2,13,11,9,15,12,18,3,5]
>> > [3,5,17,10,3,5,7,13,18,10]
>> > [7,6,11,14,7,14,3,16,4,11]
>> > [6,3,15,18,6,3,12,9,16,15]
>> >
>> > Will this be a problem?
>>
>> For replicated pools, it probably will cause trouble. For EC pools I
>> think it should work fine, but obviously you're losing all kinds of
>> redundancy. Nothing in the system will do work to avoid colocating
>> them if you use a rule like this. Rather than distributing some of the
>> replicas randomly across DCs, you really just want to split them up
>> evenly across datacenters (or in some ratio, if one has more space
>> than the other). Given CRUSH's current abilities that does require
>> building the replication size into the rule, but such is life.
>>
>>
>> > If crush is executed, will it only consider osd's which are (up,in)  or
>> > all
>> > OSD's in the map and then filter them from the list afterwards?
>>
>> CRUSH will consider all OSDs, but if it selects any OSDs which are out
>> then it retries until it gets one that is still marked in.
>> -Greg
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com