Re: CRUSH rule for 3 replicas across 2 hosts

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Mon, 20 Apr 2015 12:02:27 -0600

We have a similar issue, but we wanted three copies across two racks. Turns out, that we increased size to 4 and left min_size at 2. We didn't want to risk having less than two copies and if we only had thee copies, losing a rack would block I/O. Once we expand to a third rack, we will adjust our rule and go to size 3. Searching the mailing list and docs proved difficult, so I'll include my rule so that you can use it as a basis. You should be able to just change rack to host and host to osd. If you want to keep only three copies, the "extra" OSD chosen just won't be used as Gregory mentions. Technically this rule should have "max_size 4", but I won't set a pool over 4 copies so I didn't change it here.
If anyone has a better way of writing this rule (or one that would work for both a two rack and 3+ rack configuration as mentioned above), I'd be open to it. This is the first rule that I've really wrote on my own.

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
}

On Mon, Apr 20, 2015 at 11:50 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
On Mon, Apr 20, 2015 at 10:46 AM, Colin Corr <colin@xxxxxxxxxxxxx> wrote:

> Greetings Cephers,

>

> I have hit a bit of a wall between the available documentation and my understanding of it with regards to CRUSH rules. I am trying to determine if it is possible to replicate 3 copies across 2 hosts, such that if one host is completely lost, at least 1 copy is available. The problem I am experiencing is that if I enable my host_rule for a data pool, the cluster never gets back to a clean state. All pgs in a pool with this rule will be stuck unclean.

>

> This is the rule:

>

> rule host_rule {

>         ruleset 2

>         type replicated

>         min_size 1

>         max_size 10

>         step take default

>         step chooseleaf firstn 0 type host

>         step emit

> }

>

> And if its pertinent, all nodes are running 0.80.9 on Ubuntu 14.04. Pool pg/pgp set to 2048, replicas 3. Tunables set to optimal.

>

> I assume that is happening because of simple math: 3 copies on 2 hosts. And crush is expecting a 3rd host to balance everything out since I defined host based. This rule runs fine on another 3 host test cluster. So, it would seem that the potential solutions are to change replication to 2 copies or add a 3rd OSD host. But, with all of the cool bucket types and rule options, I suspect I am missing something here. Alas, I am hoping there is some (not so obvious to me) CRUSH magic that could be applied here.

It's actually pretty hacky: you configure your CRUSH rule to return

two OSDs from each host, but set your size to 3. You'll want to test

this carefully with your installed version to make sure that works,

though — older CRUSH implementations would crash if you did that. :(

In slightly more detail, you'll need to change it so that instead of

using "chooseleaf" you "choose" 2 hosts, and then choose or chooseleaf

2 OSDs from each of those hosts. If you search the list archives for

CRUSH threads you'll find some other discussions about doing precisely

this, and I think the CRUSH documentation should cover the more

general bits of how the language works.

-Greg

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com