Re: multi-datacenter crush map

Wouter De Borger <w.deborger@xxxxxxxxx> · Mon, 21 Sep 2015 11:02:57 +0200

Thank you for your answer! We will use size=4 and min_size=2, which should do the trick.
For the monitor issue, we have a third datacenter (with higher latency, but that shouldn't be a problem for the monitors)

We had also considered the locality issue. Our WAN round trip latency is 1.5 ms (now) and we should get a dedicated light path in the near future (<0.1 ms).
So we hope to get acceptable latency without additional tweaking. 
The plan B is to make two pools, with different weights for the different DC's. VM's in DC 1 will get high weight for DC 1, VM's in DC 2 will get high weight for DC 2. 

Thanks,
Wouter

On Sat, Sep 19, 2015 at 9:31 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

You will want size=4 min_size=2 if you want to keep I/O going if a DC

fails and ensure some data integrity. Data checksumming (I think is

being added) would provide much stronger data integrity checking in a

two copy situation as you would be able to tell which of the two

copies is the good copy instead of needing a third to break the tie.

However, you have yet another problem on your hands. The way monitors

works makes this tricky. If you have one monitor in one DC and two in

the other, if the two monitor DC burns down, the surviving cluster

stops working too because there isn't more than 50% of the monitors

available. Putting two monitors in each DC only causes both to stop

working if one goes down (you need three to make a quorum). It has

been suggested that putting the odd monitor in the cloud (or other

off-site location to both DCs) could be an option, but latency could

cause problems. The cloud monitor would complete the quorum with

whichever DC survives.

Also remember that there is no data locality awareness in Ceph at the

moment. This could mean that the primary for a PG is in the other DC.

So your client has to contact the primary in the other DC, then that

OSD contacts one OSD in it's DC and two in the other and has to get

confirmation that the write is acknowledged then ack the write to the

client. For a write you will be between 2 x ( LAN latency + WAN

latency ) and 2 x ( LAN latency + 2 x WAN latency ). Additionally your

reads will be between 2 x LAN latency and 2 x WAN latency. Then there

is write amplification so you need to make sure you have a lot more

WAN bandwidth than you think you need.

I think the large majority of us are eagerly waiting for the RBD

replication feature or some sort of lag behind OSD for situations like

this.

-----BEGIN PGP SIGNATURE-----

Version: Mailvelope v1.1.0

Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/bgrCRDmVDuy+mK58QAAOTsQAJd5Q/3uVMP6D0U+iZv/

FGvEThfxLqarEo/n/TAPiJdCeZP9sKr8szTP72Iajt5UAwH8Ry5qcvClUoet

LmMXfOxHJQJcMbXcKHxI8G7w9h/8ExkGA3GkoBYltUvZ9+oEI30ANHZphBiK

HhaLWanrEKh8L4EbXnqA9JvEYwf1BGDvxKbdvFDNSIIbDywN3DJn7OavRhC9

M63GQnFmxSO6F+Oy1q5vMfpur/VtZ27GRfzIDsougRTmM5q9zbdpSY8pHrrZ

RDExkM1t0orl1gUnbNhl/YgQTGfU/XWpEKtJju7Wk9Ciem5SFczJRWsputHc

AhBtnxBoEInlsnpHKnCsPvbY8wEcoo+YxNt79/M3cR8x0UzXl+/4SoDlYnSK

X3afL/YmVnbCV6hoxl2LAOqHbTYasN9VxQIbpQe4kAzSq45yJX//k8NRXBfD

+hGF8qfxpcbTe/9IjJiqwe+ZpaAd4vX7Xfq4oHHeMwWUrvd8sXSbr5CIV1AJ

CYsixEy2gJ0oFFVKcBGtzAfBUxJHb/FAcAuV97zSdYyYRplMq5Qjaz/hwGeu

9pC83kxY40pfzdD9uEElWoI3+6/34LdNo4TLi3IM8aeZmNGzzIgt/MxAuFOk

9Jf2Dwmab0+Ut6uJasY4Fr6HiyNoeTXea+CSWrnvsMohOseyJg996GUP3gUl

OEoA

=PfhN

-----END PGP SIGNATURE-----

----------------

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Sat, Sep 19, 2015 at 12:54 PM, Wouter De Borger <w.deborger@xxxxxxxxx> wrote:

> Ok, so if I understand correctly, for replication level 3 or 4 I would have

> to use the rule

>

> rule replicated_ruleset {

>

>     ruleset 0

>     type replicated

>     min_size 1

>     max_size 10

>     step take root

>     step choose firstn 2 type datacenter

>     step chooseleaf firstn 2 type host

>     step emit

> }

>

> The question I have now is: how will it behave when a DC goes down?

> (Assuming catastrophic failure, the thing burns down)

>

> For example, if I set replication to 3, min_rep to 3.

> Then, if a DC goes down, crush will only return 2 PG's, so everything will

> hang  (same for 4/4 and 4/3)

>

> If I set replication to 3, min_rep to 2, it could occur that all data of a

> PG is in one DC (degraded mode). if this DC goes down, the PG will hang,....

> As far as I know, degraded PG's will still accept writes, so data loss is

> possible. (same for 4/2)

>

>

>

> I can't seem to find a way around this. What am I missing.

>

>

> Wouter

>

>

>

>

> On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

>>

>> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.deborger@xxxxxxxxx>

>> wrote:

>> > Hi all,

>> >

>> > I have found on the mailing list that it should be possible to have a

>> > multi

>> > datacenter setup, if latency is low enough.

>> >

>> > I would like to set this up, so that each datacenter has at least two

>> > replicas and each PG has a replication level of 3.

>> >

>> > In this mail, it is suggested that I should use the following crush map

>> > for

>> > multi DC:

>> >

>> > rule dc {

>> >     ruleset 0

>> >     type replicated

>> >     min_size 1

>> >     max_size 10

>> >     step take default

>> >     step chooseleaf firstn 0 type datacenter

>> >     step emit

>> > }

>> >

>> > This looks suspicious to me, as it will only generate a list of two

>> > PG's,

>> > (and only one PG if one DC is down).

>> >

>> > I think I should use:

>> >

>> > rule replicated_ruleset {

>> >     ruleset 0

>> >     type replicated

>> >     min_size 1

>> >     max_size 10

>> >     step take root

>> >     step choose firstn 2 type datacenter

>> >     step chooseleaf firstn 2 type host

>> >     step emit

>> >     step take root

>> >     step chooseleaf firstn -4 type host

>> >     step emit

>> > }

>> >

>> > This correctly generates a list with 2 PG's in one DC, then 2 PG's in

>> > the

>> > other and then a list of PG's

>> >

>> > The problem is that this list contains duplicates (e.g. for 8 OSDS per

>> > DC)

>> >

>> > [13,11,1,8,13,11,16,4,3,7]

>> > [9,2,13,11,9,15,12,18,3,5]

>> > [3,5,17,10,3,5,7,13,18,10]

>> > [7,6,11,14,7,14,3,16,4,11]

>> > [6,3,15,18,6,3,12,9,16,15]

>> >

>> > Will this be a problem?

>>

>> For replicated pools, it probably will cause trouble. For EC pools I

>> think it should work fine, but obviously you're losing all kinds of

>> redundancy. Nothing in the system will do work to avoid colocating

>> them if you use a rule like this. Rather than distributing some of the

>> replicas randomly across DCs, you really just want to split them up

>> evenly across datacenters (or in some ratio, if one has more space

>> than the other). Given CRUSH's current abilities that does require

>> building the replication size into the rule, but such is life.

>>

>>

>> > If crush is executed, will it only consider osd's which are (up,in)  or

>> > all

>> > OSD's in the map and then filter them from the list afterwards?

>>

>> CRUSH will consider all OSDs, but if it selects any OSDs which are out

>> then it retries until it gets one that is still marked in.

>> -Greg

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com