Re: CRUSH rule for 3 replicas across 2 hosts

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 21 Apr 2015 10:08:47 -0600

Your logic isn't quite right and from what I understand, this is how it works:
step choose firstn 2 type rack       # Choose two racks from the CRUSH map (my CRUSH only has two, so select both of them)
step chooseleaf firstn 2 type host  # From the set chosen previously (two racks), select a leaf (osd) from from 2 hosts of each rack (each of the set returned previously).

If you have size 3, it will pick two OSDs from one rack and one from the second (remember that the first rack in placement will sometimes be 'A' and sometimes 'B' so the placement won't be totally unbalanced).

Where the min_size and max_size comes in could be something like this (this is somewhat exaggerated):

Lets say that you want the minimal possible latency and highest bandwidth and are OK with losing data (swap partitions or something). You create a pool with size 1 and a rule like this:

rule replicated_swap {
        ruleset 0
        type replicated
        min_size 1
        max_size 1
        step take default        step chooseleaf firstn 0 type host
        step emit
}

Then you have a pool you want to put on some hosts that have RAID5 prtected OSDs, so you don't need as many replications because RAID will protect from disk failures:
rule replicated_radi5 {
        ruleset 1
        type replicated
        min_size 2
        max_size 2
        step take raid5        step chooseleaf firstn 0 type host
        step emit
}

Then you have a pool that you want "default" protection for 3-4 copies:
rule replicated_default {
        ruleset 2
        type replicated
        min_size 3
        max_size 4
        step take default        step chooseleaf firstn 0 type host
        step emit
}

Then you have a pool that you absolutely can't lose data on, so you have lots of copies and want it spread throughout the data center:
rule replicated_paranoid {
        ruleset 3
        type replicated
        min_size 5
        max_size 10
        step take default        step chooseleaf firstn 0 type rack
        step emit
}

You then specify the rule to use for each pool. Again, the min and max size is a selector for the rule. If the actual pool size is outside of the min and max, then the rule should not run (I don't know if it actually does this or is just a reminder for the human to know what sizes the rule was intentionally written for).

On Tue, Apr 21, 2015 at 8:36 AM, Colin Corr <colin@xxxxxxxxxxxxx> wrote:

On 04/20/2015 04:18 PM, Robert LeBlanc wrote:

> You usually won't end up with more than the "size" number of replicas, even in a failure situation. Although technically more than "size" number of OSDs may have the data (if the OSD comes back in service, the journal may be used to quickly get the OSD back up to speed), these would not be active.

>

> For us using size 4 and min size 2 is so that we can lose an entire rack (2 copies) but not block I/O. Our configuration prevents four copies in one rack. If we lose a rack and then an OSD in the surviving rack, write I/O to those placement groups groups will block until the objects have been replicated elsewhere in the rack, but it would not be more than 2 copies.

>

> I hope I'm making sense and this my jabbering is useful.

Yes, it is helpful, thank you. My clarity level has been upgraded from mud to stained glass.

If I am following the logic of your rule correctly:

1. If we have less than 2 replicas per rack, run this step:

step choose firstn 2 type rack

2. If we have less than 2 replicas on our hosts in this rack, run this step:

step chooseleaf firstn 2 type host

I still don't understand where exactly max_size comes into play, unless you have some elaborate chain of rules, like mixing platter and ssd drives in the same pool. The documented example for this scenario is the only one I have found that utilizes the max_size in a meaningful way.

Anyway, thanks for your help in translating from CRUSH to English.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com