Hi, I have 5 data nodes (bluestore, kraken), each with 24 OSDs. I enabled the optimal crush tunables. I’d like to try to “really” use EC pools, but until now I’ve faced cluster lockups when I was using 3+2 EC pools with a host failure domain. When a host was down for instance ;) Since I’d like the erasure codes to be more than a “nice to have feature with 12+ ceph data nodes”, I wanted to try this : -
Use a 14+6 EC rule -
And for each data chunk:
o
select 4 hosts
o
On these hosts, select 5 OSDs In order to do that, I created this rule in the crush map : rule 4hosts_20shards { ruleset 3 type erasure min_size 20 max_size 20 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step choose indep 4 type host step chooseleaf indep 5 type osd
step emit } I then created an EC pool with this erasure profile : ceph osd erasure-code-profile set erasurep14_6_osd ruleset-failure-domain=osd k=14 m=6 I hoped this would allow for loosing 1 host completely without locking the cluster, and I have the impression this is working.. But. There’s always a but ;) I tried to make all OSDs down by stopping the ceph-osd daemons on one node. And according to ceph, the cluster is unhealthy. The ceph health detail fives me for instance this (for the 3+2 and 14+6 pools) : pg 5.18b is active+undersized+degraded, acting [57,47,2147483647,23,133] pg 9.186 is active+undersized+degraded, acting [2147483647,2147483647,2147483647,2147483647,2147483647,133,142,125,131,137,50,48,55,65,52,16,13,18,22,3] My question therefore is : why aren’t the down PGs remapped onto my 5th data node since I made sure the 20 EC shards were spread onto 4 hosts only ? I thought/hoped that because osds were down, the data would be rebuilt onto another OSD/host ? I can understand the 3+2 EC pool cannot allocate OSDs on another host because the 3+2=5 hosts already, but I don’t understand why the 14+6 EC pool/pgs do not rebuild somewhere else ? I do not find anything worth in a “ceph pg query”, the up and acting parts are equal and do contain the 2147483647 value (wich means none as far as I understood). I’ve also tried to “ceph osd out” all the OSDs from one host : in that case, the 3+2 EC PGs behaves as previously, but the 14+6 EC PGs seem happy despite the fact they are still saying the out OSDs are up and acting. Is my crush rule that wrong ? Is it possible to do what I want ? Thanks for any hints… Regards Frederic |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com