Re: Troubleshooting hanging storage backend whenever there is any cluster change

David Turner <drakonstein@xxxxxxxxx> · Fri, 12 Oct 2018 09:59:59 -0400

The PGs per OSD does not change unless the OSDs are marked out.  You have noout set, so that doesn't change at all during this test.  All of your PGs peered quickly at the beginning and then were active+undersized the rest of the time, you never had any blocked requests, and you always had 100MB/s+ client IO.  I didn't see anything wrong with your cluster to indicate that your clients had any problems whatsoever accessing data.
Can you confirm that you saw the same problems while you were running those commands?  The next thing would seem that possibly a client isn't getting an updated OSD map to indicate that the host and its OSDs are down and it's stuck trying to communicate with host7.  That would indicate a potential problem with the client being unable to communicate with the Mons maybe?  Have you completely ruled out any network problems between all nodes and all of the IPs in the cluster.  What does your client log show during these times?

On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG <n.fahldieck@xxxxxxxxxxxx> wrote:
Hi, in our `ceph.conf` we have:

  mon_max_pg_per_osd = 300

While the host is offline (9 OSDs down):

  4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD

If all OSDs are online:

  4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD

... so this doesn't seem to be the issue.

If I understood you right, that's what you've meant. If I got you wrong,

would you mind to point to one of those threads you mentioned?

Thanks :)

Am 12.10.2018 um 14:03 schrieb Burkhard Linke:

> Hi,

> 

> 

> On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote:

>> I rebooted a Ceph host and logged `ceph status` & `ceph health detail`

>> every 5 seconds. During this I encountered 'PG_AVAILABILITY Reduced data

>> availability: pgs peering'. At the same time some VMs hung as described

>> before.

> 

> Just a wild guess... you have 71 OSDs and about 4500 PG with size=3.

> 13500 PG instance overall, resulting in ~190 PGs per OSD under normal

> circumstances.

> 

> If one host is down and the PGs have to re-peer, you might reach the

> limit of 200 PG/OSDs on some of the OSDs, resulting in stuck peering.

> 

> You can try to raise this limit. There are several threads on the

> mailing list about this.

> 

> Regards,

> Burkhard

> 

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com