Hi, On 28.09.2018 18:04, Vladimir Brik wrote:
Hello I've attempted to increase the number of placement groups of the pools in our test cluster and now ceph status (below) is reporting problems. I am not sure what is going on or how to fix this. Troubleshooting scenarios in the docs don't seem to quite match what I am seeing. I have no idea how to begin to debug this. I see OSDs listed in "blocked_by" of pg dump, but don't know how to interpret that. Could somebody assist please? I attached output of "ceph pg dump_stuck -f json-pretty" just in case. The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am running 13.2.2. This is the affected pool: pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor 0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs
Just a guess: are you running in the pgs-per-osd limit? In luminous an OSD will stop accepting new PGs if a certain limit (default afaik 200) of PG on that OSD is reached. The PGs stay in the activating state, similar to your output.
The mentioned pool has 2048 pgs, size=5 -> ~10.000 instances, 100 osds -> ~ 100 pgs per osds with that pool alone. The output mentions an overall PG number of 5120, so there are probably other pools, too.
You can check this by running 'ceph osd df'; the last column is the number of PGs on the OSD. If this number if >= 200, the OSD will not accept new PGs.
You can adopt the limits with the mon_max_pg_per_osd and max_pg_per_osd_hard_ratio settings. See ceph documentation for more details about this.
Regards, Burkhard _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com