Re: Erasure coded pools and ceph failure domain setup

Hector Martin <hector@xxxxxxxxxxxxxx> · Mon, 4 Mar 2019 17:37:23 +0900

On 02/03/2019 01:02, Ravi Patel wrote:
Hello,

My question is how crush distributes chunks throughout the cluster with 
erasure coded pools. Currently, we have 4 OSD nodes with 36 drives(OSD 
daemons) per node. If we use ceph_failire_domaon=host, then we are 
necessarily limited to k=3,m=1, or k=2,m=2. We would like to explore 
k>3, m>2 modes of coding but are unsure how the crush rule set will 
distribute the chunks if we set the crush_failure_domain to OSD

Ideally, we would like CRUSH to distribute the chunks hierarchically so 
to spread them evenly across the nodes. For example, all chunks are on a 
single node.

Are chunks evenly spread by default? If not, how might we go about 
configuring them?
You can write your own CRUSH rules to distribute chunks hierarchically. 
For example, you can have a k=6, m=2 code together with a rule that 
guarantees that each node gets two chunks. This means that if you lose a 
node you do not lose data (though depending on your min_size setting 
your pool might be unavailable at that point until you replace the node 
or add a new one and the chunks can be recovered). You would accomplish 
this with a rule that looks like this:

rule ec8 {
        id <some free id>
        type erasure
        min_size 7
        max_size 8
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        step choose indep 4 type host
        step chooseleaf indep 2 type osd
        step emit
}

This means the rule will first pick 4 hosts, then pick 2 OSDs per host, 
resulting in a total of 8 OSDs. This is appropriate for k=6 m=2 codes as 
well as k=5 m=2 codes (that will just leave one random OSD unused), 
hence min_size 7 max_size 8.

If you just set crush_failure_domain to OSD, then the rule will pick 
random OSDs without regard for the hosts; you will be able to use 
effectively any EC widths you want, but there will be no guarantees of 
data durability if you lose a whole host.

--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com