PGs stuck activating after adding new OSDs

Jon Light <jon@xxxxxxxxxxxx> · Tue, 27 Mar 2018 14:07:43 -0400

Hi all,
I'm adding a new OSD node with 36 OSDs to my cluster and have run into some problems. Here are some of the details of the cluster:

1 OSD node with 80 OSDs
1 EC pool with k=10, m=3
pg_num 1024
osd failure domain

I added a second OSD node and started creating OSDs with ceph-deploy, one by one. The first 2 added fine, but each subsequent new OSD resulted in more and more PGs stuck activating. I've added a total of 14 new OSDs, but had to set 12 of those with a weight of 0 to get the cluster healthy and usable until I get it fixed.

I have read some things about similar behavior due to PG overdose protection, but I don't think that's the case here because the failure domain is set to osd. Instead, I think my CRUSH rule need some attention:

rule main-storage {
        id 1
        type erasure
        min_size 3
        max_size 13
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 0 type osd
        step emit
}

I don't believe I have modified anything from the automatically generated rule except for the addition of the hdd class.

I have been reading the documentation on CRUSH rules, but am having trouble figuring out if the rule is setup properly. After a few more nodes are added I do want to change the failure domain to host, but osd is sufficient for now.

Can anyone help out to see if the rule is causing the problems or if I should be looking at something else?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com