Re: CRUSH rule for 3 replicas across 2 hosts

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Mon, 20 Apr 2015 17:18:02 -0600

You usually won't end up with more than the "size" number of replicas, even in a failure situation. Although technically more than "size" number of OSDs may have the data (if the OSD comes back in service, the journal may be used to quickly get the OSD back up to speed), these would not be active. 
For us using size 4 and min size 2 is so that we can lose an entire rack (2 copies) but not block I/O. Our configuration prevents four copies in one rack. If we lose a rack and then an OSD in the surviving rack, write I/O to those placement groups groups will block until the objects have been replicated elsewhere in the rack, but it would not be more than 2 copies.

I hope I'm making sense and this my jabbering is useful.

On Mon, Apr 20, 2015 at 4:08 PM, Colin Corr <colin@xxxxxxxxxxxxx> wrote:

On 04/20/2015 01:46 PM, Robert LeBlanc wrote:

>

>

> On Mon, Apr 20, 2015 at 2:34 PM, Colin Corr <colin@xxxxxxxxxxxxx <mailto:colin@xxxxxxxxxxxxx>> wrote:

>

>

>

>     On 04/20/2015 11:02 AM, Robert LeBlanc wrote:

>     > We have a similar issue, but we wanted three copies across two racks. Turns out, that we increased size to 4 and left min_size at 2. We didn't want to risk having less than two copies and if we only had thee copies, losing a rack would block I/O. Once we expand to a third rack, we will adjust our rule and go to size 3. Searching the mailing list and docs proved difficult, so I'll include my rule so that you can use it as a basis. You should be able to just change rack to host and host to osd. If you want to keep only three copies, the "extra" OSD chosen just won't be used as Gregory mentions. Technically this rule should have "max_size 4", but I won't set a pool over 4 copies so I didn't change it here.

>     >

>     > If anyone has a better way of writing this rule (or one that would work for both a two rack and 3+ rack configuration as mentioned above), I'd be open to it. This is the first rule that I've really wrote on my own.

>     >

>     > rule replicated_ruleset {

>     >         ruleset 0

>     >         type replicated

>     >         min_size 1

>     >         max_size 10

>     >         step take default

>     >         step choose firstn 2 type rack

>     >         step chooseleaf firstn 2 type host

>     >         step emit

>     > }

>

>     Thank you Robert. Your example was very helpful. I didn't realize you could nest the choose and chooseleaf steps together. I thought chooseleaf effectively handled that for you already. This makes a bit more sense now.

>

>

> I'm still a little fuzzy on it myself as well, but by not having an emit step between the choose and chooseleaf makes chooseleaf operate on the items chosen by choose instead of picking new things from all available entities. I couldn't get crushtool --test --simulate to work properly to confirm (http://tracker.ceph.com/issues/11224), but it is working properly in our cluster. Just FYI, the min_size and max_size does not change your pools, it only specifies what sizes the rule works for. Technically if the pool size (replica size) is less than 2 or greater than 3, this rule would not be selected.

Thanks for the help. Reading your comments and re-reading the documentation is helpful in understanding how the rule language works. I had a few misconceptions.

Any thoughts as to what conditions would cause us to end up with more than the specified number of replicas? Is it for recovery scenarios or like a safety rail for flapping OSDs?

It would seem that the default min_size and max_size values (1 and 10) are sufficient for this rule, just as you demonstrated in your rule.

rule host_rule {

        ruleset 2

        type replicated

        min_size 1

        max_size 10

        step take default

        step choose firstn 2 type host

        step chooseleaf firstn 2 type osd

        step emit

}

>     My rule looks like this now:

>     rule host_rule {

>             ruleset 2

>             type replicated

>             min_size 2

>             max_size 3

>             step take default

>             step choose firstn 2 type host

>             step chooseleaf firstn 2 type osd

>             step emit

>     }

>

>     And the cluster is reporting the pool as clean, finally. If I understand correctly, we will now potentially have as many as 4 replicas of an object in the pool, 2 on each host.

>

>

> You will only have 4 replicas if you set the size of your pool to 4, otherwise if it is the default, it will be three. The rule will support up to 4 replicas.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com