Re: Problems after increasing number of PGs in a pool

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 28 Sep 2018 20:03:32 +0200

Hi,

On 28.09.2018 18:04, Vladimir Brik wrote:
Hello

I've attempted to increase the number of placement groups of the pools
in our test cluster and now ceph status (below) is reporting problems. I
am not sure what is going on or how to fix this. Troubleshooting
scenarios in the docs don't seem to quite match what I am seeing.

I have no idea how to begin to debug this. I see OSDs listed in
"blocked_by" of pg dump, but don't know how to interpret that. Could
somebody assist please?

I attached output of "ceph pg dump_stuck -f json-pretty" just in case.

The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am
running 13.2.2.

This is the affected pool:
pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor
0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs

Just a guess: are you running in the pgs-per-osd limit? In luminous an 
OSD will stop accepting new PGs if a certain limit (default afaik 200) 
of PG on that OSD is reached. The PGs stay in the activating state, 
similar to your output.

The mentioned pool has 2048 pgs, size=5 -> ~10.000 instances, 100 osds 
-> ~ 100 pgs per osds with that pool alone. The output mentions an 
overall PG number of 5120, so there are probably other pools, too.

You can check this by running 'ceph osd df'; the last column is the 
number of PGs on the OSD. If this number if >= 200, the OSD will not 
accept new PGs.

You can adopt the limits with the mon_max_pg_per_osd and 
max_pg_per_osd_hard_ratio settings. See ceph documentation for more 
details about this.

Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com