Re: Troubleshooting hanging storage backend whenever there is any cluster change

Nils Fahldieck - Profihost AG <n.fahldieck@xxxxxxxxxxxx> · Fri, 12 Oct 2018 14:35:03 +0200

Hi, in our `ceph.conf` we have:

  mon_max_pg_per_osd = 300

While the host is offline (9 OSDs down):

  4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD

If all OSDs are online:

  4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD

... so this doesn't seem to be the issue.

If I understood you right, that's what you've meant. If I got you wrong,
would you mind to point to one of those threads you mentioned?

Thanks :)

Am 12.10.2018 um 14:03 schrieb Burkhard Linke:
> Hi,
> 
> 
> On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote:
>> I rebooted a Ceph host and logged `ceph status` & `ceph health detail`
>> every 5 seconds. During this I encountered 'PG_AVAILABILITY Reduced data
>> availability: pgs peering'. At the same time some VMs hung as described
>> before.
> 
> Just a wild guess... you have 71 OSDs and about 4500 PG with size=3.
> 13500 PG instance overall, resulting in ~190 PGs per OSD under normal
> circumstances.
> 
> If one host is down and the PGs have to re-peer, you might reach the
> limit of 200 PG/OSDs on some of the OSDs, resulting in stuck peering.
> 
> You can try to raise this limit. There are several threads on the
> mailing list about this.
> 
> Regards,
> Burkhard
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com