Hi all,
I'm adding a new OSD node with 36 OSDs to my cluster and have run into some problems. Here are some of the details of the cluster:
1 OSD node with 80 OSDs
1 EC pool with k=10, m=3
pg_num 1024
osd failure domain
I added a second OSD node and started creating OSDs with ceph-deploy, one by one. The first 2 added fine, but each subsequent new OSD resulted in more and more PGs stuck activating. I've added a total of 14 new OSDs, but had to set 12 of those with a weight of 0 to get the cluster healthy and usable until I get it fixed.
I have read some things about similar behavior due to PG overdose protection, but I don't think that's the case here because the failure domain is set to osd. Instead, I think my CRUSH rule need some attention:
rule main-storage {
id 1
type erasure
min_size 3
max_size 13
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type osd
step emit
}
I don't believe I have modified anything from the automatically generated rule except for the addition of the hdd class.
I have been reading the documentation on CRUSH rules, but am having trouble figuring out if the rule is setup properly. After a few more nodes are added I do want to change the failure domain to host, but osd is sufficient for now.
Can anyone help out to see if the rule is causing the problems or if I should be looking at something else?
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com