Re: PG bottlenecks

Rafał Wądołowski <rwadolowski@xxxxxxxxxxxxxx> · Mon, 25 Mar 2019 11:18:14 +0100

root@ceph-ceph1:/var/log/ceph# ceph osd blocked-by
osd num_blocked
 32           1
 12           1
 14           1
 37           1
 23           1
 51           2
 46           1
 19           1
  9           1
 33           1
 24           2
 44           2
 53           7
 50           2
 35           1
 48           4
 45           2
 49           2
  5        2803
  3        2781
  2        2785
  1        2694
  4        2753

That's interesting.. I checked logs of osd.5 but it looks normal

Best Regards,

Rafał Wądołowski

On 25.03.2019 11:11, Sage Weil wrote:
> What does 'ceph osd blocked-by' show?
>
> On Mon, 25 Mar 2019, Rafał Wądołowski wrote:
>
>> This issue happened a week ago, so I don't have output from pg query.
>>
>> Now on the test cluster I am observing similiar problems. Output from
>> query in attachment.
>>
>>   data:
>>     pools:   5 pools, 32800 pgs
>>     objects: 11.53E objects, 62.2GiB
>>     usage:   176GiB used, 1.05TiB / 1.23TiB avail
>>     pgs:     30.899% pgs not active
>>              20193 active+clean
>>              7573  activating+degraded
>>              2525  activating
>>              2460  active+recovery_wait+degraded
>>              14    remapped+peering
>>              11    down
>>              5     activating+degraded+remapped
>>              4     activating+remapped
>>              3     active+recovery_wait+degraded+remapped
>>              2     stale+active+clean
>>              2     peering
>>              2     active+clean+remapped
>>              1     active+undersized+degraded
>>              1     activating+undersized+degraded
>>              1     active+recovery_wait+undersized+degraded
>>              1     active+recovery_wait
>>              1     active+recovering
>>              1     active+recovering+degraded
>>
>> pool 5 'test' erasure size 6 min_size 4 crush_rule 1 object_hash
>> rjenkins pg_num 32768 pgp_num 32768 last_change 153 lfor 0/150 flags
>> hashpspool stripe_width 16384 application rbd
>>
>> It looks that cluster is blocked by something... This cluster is 12.2.11
>>
>>
>> Best Regards,
>>
>> Rafał Wądołowski
>>
>> On 25.03.2019 10:56, Sage Weil wrote:
>>> On Mon, 25 Mar 2019, Rafał Wądołowski wrote:
>>>> Hi,
>>>>
>>>> On one of our cluster (3400 OSD, ~25PB, 12.2.4), we incremented pg_num &
>>>> pgp_num on one pool (EC 4+2) from 32k to 64k. After that cluster started
>>>> to be instable for one hour, pgs were inactive (some activating, some
>>>> peering).
>>>>
>>>> Any idea what bottlenecks we hit? Any ideas what should I change in
>>>> configuration of ceph/os ?
>>> Could be lots of things. 
>>>
>>> What does 'ceph tell <pgid> query' show for one of the activating or 
>>> peering pgs?
>>>
>>> Note that you're moving ~half of hte data around in yoru cluster with that 
>>> change, so you will see each of those PGs cycle through backfill -> 
>>> peering -> activating -> active in the course of it moving.
>>>
>>> sage
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com