ceph crush map rules for EC pools and out OSDs ?

SCHAER Frederic <frederic.schaer@xxxxxx> · Wed, 1 Mar 2017 16:33:01 +0000

Hi,

I have 5 data nodes (bluestore, kraken), each with 24 OSDs.
I enabled the optimal crush tunables.
I’d like to try to “really” use EC pools, but until now I’ve faced cluster lockups when I was using 3+2 EC pools with a host failure domain.
When a host was down for instance ;)

Since I’d like the erasure codes to be more than a “nice to have feature with 12+ ceph data nodes”, I wanted to try this :

-         
Use a 14+6 EC rule
-         
And for each data chunk:

o  
 select 4 hosts

o  
On these hosts, select 5 OSDs

In order to do that, I created this rule in the crush map :

rule 4hosts_20shards {
        ruleset 3
        type erasure
        min_size 20
        max_size 20
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        step choose indep 4 type host
        step chooseleaf indep 5 type osd

step emit
}

I then created an EC pool with this erasure profile :
ceph osd erasure-code-profile set erasurep14_6_osd  ruleset-failure-domain=osd k=14 m=6

I hoped this would allow for loosing 1 host completely  without locking the cluster, and I have the impression this is working..
But. There’s always a but ;)

I tried to make all OSDs down by stopping the ceph-osd daemons on one node.
And according to ceph, the cluster is unhealthy.
The ceph health detail fives me for instance this (for the 3+2 and 14+6 pools) :

pg 5.18b is active+undersized+degraded, acting [57,47,2147483647,23,133]
pg 9.186 is active+undersized+degraded, acting [2147483647,2147483647,2147483647,2147483647,2147483647,133,142,125,131,137,50,48,55,65,52,16,13,18,22,3]

My question therefore is : why aren’t the down PGs remapped onto my 5^th data node since I made sure the 20 EC shards were spread onto 4 hosts only ?
I thought/hoped that because osds were down, the data would be rebuilt onto another OSD/host ?
I can understand the 3+2 EC pool cannot allocate OSDs on another host because the 3+2=5 hosts already, but I don’t understand why the 14+6 EC pool/pgs do not rebuild somewhere else ?

I do not find anything worth in a “ceph pg query”, the up and acting parts are equal and do contain the 2147483647 value (wich means none as far as I understood).

I’ve also tried to “ceph osd out” all the OSDs from one host : in that case, the 3+2 EC PGs behaves as previously, but the 14+6 EC PGs seem happy despite the fact they are still saying the out OSDs are up and acting.
Is my crush rule that wrong ?
Is it possible to do what I want ?

Thanks for any hints…

Regards
Frederic

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com