Re: Problems after increasing number of PGs in a pool

Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> · Mon, 1 Oct 2018 09:08:46 -0500

Thanks to everybody who responded. The problem was, indeed, that I hit
the limit on the number of PGs per SSD OSD when I increased the number
of PGs in a pool.

One question though: should I have received a warning that some OSDs are
close to their maximum PG limit? A while back, in a Luminous test pool I
remember seeing something like "too many PGs per OSD" in some of my
testing, but not this time (perhaps because this time I hit the limit
during the resizing operation). Where might such warning be recorded if
not in "ceph status"?

Thanks,

Vlad

On 09/28/2018 01:04 PM, Paul Emmerich wrote:
> I guess the pool is mapped to SSDs only from the name and you only got 20 SSDs.
> So you should have about ~2000 effective PGs taking replication into account.
> 
> Your pool has ~10k effective PGs with k+m=5 and you seem to have 5
> more pools....
> 
> Check "ceph osd df tree" to see how many PGs per OSD you got.
> 
> Try increasing these two options to "fix" it.
> 
> mon max pg per osd
> osd max pg per osd hard ratio
> 
> 
> Paul
> Am Fr., 28. Sep. 2018 um 18:05 Uhr schrieb Vladimir Brik
> <vladimir.brik@xxxxxxxxxxxxxxxx>:
>>
>> Hello
>>
>> I've attempted to increase the number of placement groups of the pools
>> in our test cluster and now ceph status (below) is reporting problems. I
>> am not sure what is going on or how to fix this. Troubleshooting
>> scenarios in the docs don't seem to quite match what I am seeing.
>>
>> I have no idea how to begin to debug this. I see OSDs listed in
>> "blocked_by" of pg dump, but don't know how to interpret that. Could
>> somebody assist please?
>>
>> I attached output of "ceph pg dump_stuck -f json-pretty" just in case.
>>
>> The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am
>> running 13.2.2.
>>
>> This is the affected pool:
>> pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6
>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor
>> 0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs
>>
>>
>> Thanks,
>>
>> Vlad
>>
>>
>> ceph health
>>
>>   cluster:
>>     id:     47caa1df-42be-444d-b603-02cad2a7fdd3
>>     health: HEALTH_WARN
>>             Reduced data availability: 155 pgs inactive, 47 pgs peering,
>> 64 pgs stale
>>             Degraded data redundancy: 321039/114913606 objects degraded
>> (0.279%), 108 pgs degraded, 108 pgs undersized
>>
>>   services:
>>     mon: 5 daemons, quorum ceph-1,ceph-2,ceph-3,ceph-4,ceph-5
>>     mgr: ceph-3(active), standbys: ceph-2, ceph-5, ceph-1, ceph-4
>>     mds: cephfs-1/1/1 up  {0=ceph-5=up:active}, 4 up:standby
>>     osd: 100 osds: 100 up, 100 in; 165 remapped pgs
>>
>>   data:
>>     pools:   6 pools, 5120 pgs
>>     objects: 22.98 M objects, 88 TiB
>>     usage:   154 TiB used, 574 TiB / 727 TiB avail
>>     pgs:     3.027% pgs not active
>>              321039/114913606 objects degraded (0.279%)
>>              4903 active+clean
>>              105  activating+undersized+degraded+remapped
>>              61   stale+active+clean
>>              47   remapped+peering
>>              3    stale+activating+undersized+degraded+remapped
>>              1    active+clean+scrubbing+deep
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com