Re: multi-datacenter crush map

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 21 Sep 2015 15:23:57 -0700

On Mon, Sep 21, 2015 at 7:07 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
>
>
>
> On Mon, Sep 21, 2015 at 3:02 AM, Wouter De Borger  wrote:
>> Thank you for your answer! We will use size=4 and min_size=2, which should
>> do the trick.
>>
>> For the monitor issue, we have a third datacenter (with higher latency, but
>> that shouldn't be a problem for the monitors)
>
> Just be sure that the IP address in the third DC is higher than the
> other two. Ceph chooses the lowest IP address as the primary monitor.
> Having the monitor with the highest latency as your primary is sure to
> give you grief.
>
>> We had also considered the locality issue. Our WAN round trip latency is 1.5
>> ms (now) and we should get a dedicated light path in the near future (<0.1
>> ms).
>> So we hope to get acceptable latency without additional tweaking.
>> The plan B is to make two pools, with different weights for the different
>> DC's. VM's in DC 1 will get high weight for DC 1, VM's in DC 2 will get high
>> weight for DC 2.
>
> You don't want to mess with weights or else you will get stuck PGs
> because there isn't enough storage at the remote DC to hold all your
> data or the local DC fills up faster or CRUSH can't figure out how to
> distribute balanced data in an unbalanced way. You will probably want
> to look at primary affinity. This talks about SSD, but the same
> principle applies
> http://www.sebastien-han.fr/blog/2015/08/06/ceph-get-the-best-of-your-ssd-with-primary-affinity/

Keep in mind that primary affinity is per-OSD and applies to all
pools. If you have separate pools for each DC, you can construct
custom CRUSH rules for each one that always picks one or two OSDs out
of the "local" DC and then one or two out of the secondary.
Of course if you lose the primary DC you might want to edit the CRUSH
rules — things should keep working but I'd expect the CRUSH
calculations to start taking longer as it repeatedly tries and fails
to get the OSDs out of the first list. ;)

But if you can get a <.1ms link I wouldn't expect you to need to worry about it.
-Greg

>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWAA80CRDmVDuy+mK58QAAK2UQAIadviUNo+hybghYO2de
> 0Ayqy1SQ9hPjpwJUTWt0VKIgIoDMMbp1+LVd6EiPRNXPPwalDZen8z1ARW5F
> o0g0nxMkJIJWxY2U3ch4s3oE85Hl5NqyPZIdaLfqQv7PXfFsRTM7rZDMRuLn
> X9PWcxzZ3lvXj6WA/jiN42zXlhTlhLhtKV153poCtd/iCkytJUEPZF/c+Pr+
> /mIVVbbtzPDh5XC/Il/KEI1W+fZzPG27EL6N2YoskYWTEpUr0auBH7jTLm8B
> kk4HKZToBsakP/j/LEhrh+MMW/ZNwuUaigWe351YQrkmlUBXxKVtnlgfJpDL
> KIGvVDICRMwe6e3ubU7umsiFVchYPBlXFN3OoRmx3UfPp3WlfKong+PG9X8+
> JoukMyj630jmaDy7KPH5sKv+LUaLpWlWv6271bVUDlSMjPNHqVMlk/HsOGit
> QYaWQQygQ+D/YedLP66xz+O6A8qfmrYfWBZZDdG9C49erw6Q6Ihzp6u+Hj1O
> zt96JsmBjHcq27AfK1jSbZDGOHchL78/Ok3O4O01tC00HwRvQLy/ofppOzgn
> KHnuLMs/9fZpSqh5sZ1BxWSgKBnYzc3npFloaOZLoVPT4Dv+hM8kL9lF0mHm
> 3j1D4ZmysVaSU+zVw0WsWaNVQyrrPn+i9ivT6U/3VHiaS2Nz+BQTQeq0oMqQ
> awOM
> =sl/W
> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com