Re: [PHISHING VERDACHT] ceph is stuck after increasing pg_nums

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 4 Nov 2022 11:17:59 +0100

Hi,

On 11/4/22 09:45, Adrian Nicolae wrote:
Hi,

We have a Pacific cluster (16.2.4) with 30 servers and 30 osds. We 
started to increase the pg_num for the data bucket for more than a 
month, I usually added 64 pgs in every step I didn't have any issue. 
The cluster was healthy before increasing the pgs.

Today I've added 128 pgs  and the cluster is stuck with some unknown 
pgs and some other in peering state. I've restarted a few osds with 
slow_ops and even a few hosts but it didn't change anything. We don't 
have any networking issue .  Do you have any suggestion ?  Our service 
is completely down ...

*snipsnap*

Do some of the OSDs exceed the PGs per OSD limit? If this is the case, 
the affected OSDs will not allow peering, and tI/O to that OSDs will be 
stuck.

You can check the number of PGs in the 'ceph osd df tree' output. To 
solve this problem you can increase the limit e.g. by setting 
'osd.mon_max_pg_per_osd' in 'ceph config'. The default limit is 200 AFAIK.

Regards,

Burkhard

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx