CEPH cluster stopped client I/O's when OSD host hangs

Prayank Saxena <pr31189@xxxxxxxxx> · Wed, 2 Feb 2022 22:56:48 +0530

Hello Everyone,

We encounter some issue that OS hanging on host OSD causes the cluster to
stop ingesting data.

Below are CEPH Cluster details:

CEPH Object Storage v14.2.22

No. of Monitor nodes: 5

No. of RGW nodes:5

No.of OSD's:252 (all NVME's)

OS : Centos 7.9

kernel: 3.10.0-1160.45.1.el7.x86_64

Network: One NIC port 25G (used as public and cluster network)

Data node hardware spec:

   FW: EDB8007Q

   MODEL: SAMSUNG MZ4LB15THMLA-00003

Our architecture has two osd in one host, the disk01 is shared with kernel

                ┌──────┐

         ┌──────┤disk01│

┌────────┤      └──────┘

│osd-host│

└────────┤

         │      ┌──────┐

         └──────┤disk02│

                └──────┘

disk01 is detected failed and out of the cluster but disk02 is not.

The Host OSD kernel suddenly hangs (cannot ssh). This happens sometimes but
the OSD residing to that is always flags as down.

This time when the host osd hangs the disk01 is flagged as down and disk02
is not flagged as down. At the same time all RGW services alerts as down
but while checking, services seem to be up, this causes the cluster to stop
ingesting data, only after the host osd is shutdown that we regain the
cluster ingestion. No other osd's were down and out and the cluster was in
OK state before we had this incident.

*from logs*

--------

2022-01-25 15:42:10.994227 mon.cluster01-mon-001 (mon.0) 25030 : cluster
[DBG] osd.57 reported failed by osd.118

2022-01-25 15:42:11.340162 mon.cluster01-mon-001 (mon.0) 25031 : cluster
[DBG] osd.57 reported failed by osd.248

2022-01-25 15:42:11.340435 mon.cluster01-mon-001 (mon.0) 25032 : cluster
[INF] osd.57 failed (root=default,rack=rack004,host=cluster01-osd-1029) (2
reporters from different host after 23.000120 >= grace 20.000000)

2022-01-25 15:42:11.395065 mon.cluster01-mon-001 (mon.0) 25033 : cluster
[WRN] Health check failed: 1 osds down (OSD_DOWN)

 --------

 We have encountered OS hang situations before but everytime osd's were
marked down by cluster after grace period and it never impacted client
I/O's. But this time due to the above issue RGW nodes were not able to take
any requests.

Has anyone encountered a similar case where a kernel hang does not flag the
osd as down?

If yes, any proactive measures that we can take to avoid such an incident?

Also, if any more information is needed, please do let me know.
Any help would be appreciated.

Regards
Prayank Saxena
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx