Hello Everyone, We encounter some issue that OS hanging on host OSD causes the cluster to stop ingesting data. Below are CEPH Cluster details: CEPH Object Storage v14.2.22 No. of Monitor nodes: 5 No. of RGW nodes:5 No.of OSD's:252 (all NVME's) OS : Centos 7.9 kernel: 3.10.0-1160.45.1.el7.x86_64 Network: One NIC port 25G (used as public and cluster network) Data node hardware spec: FW: EDB8007Q MODEL: SAMSUNG MZ4LB15THMLA-00003 Our architecture has two osd in one host, the disk01 is shared with kernel ┌──────┐ ┌──────┤disk01│ ┌────────┤ └──────┘ │osd-host│ └────────┤ │ ┌──────┐ └──────┤disk02│ └──────┘ disk01 is detected failed and out of the cluster but disk02 is not. The Host OSD kernel suddenly hangs (cannot ssh). This happens sometimes but the OSD residing to that is always flags as down. This time when the host osd hangs the disk01 is flagged as down and disk02 is not flagged as down. At the same time all RGW services alerts as down but while checking, services seem to be up, this causes the cluster to stop ingesting data, only after the host osd is shutdown that we regain the cluster ingestion. No other osd's were down and out and the cluster was in OK state before we had this incident. *from logs* -------- 2022-01-25 15:42:10.994227 mon.cluster01-mon-001 (mon.0) 25030 : cluster [DBG] osd.57 reported failed by osd.118 2022-01-25 15:42:11.340162 mon.cluster01-mon-001 (mon.0) 25031 : cluster [DBG] osd.57 reported failed by osd.248 2022-01-25 15:42:11.340435 mon.cluster01-mon-001 (mon.0) 25032 : cluster [INF] osd.57 failed (root=default,rack=rack004,host=cluster01-osd-1029) (2 reporters from different host after 23.000120 >= grace 20.000000) 2022-01-25 15:42:11.395065 mon.cluster01-mon-001 (mon.0) 25033 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) -------- We have encountered OS hang situations before but everytime osd's were marked down by cluster after grace period and it never impacted client I/O's. But this time due to the above issue RGW nodes were not able to take any requests. Has anyone encountered a similar case where a kernel hang does not flag the osd as down? If yes, any proactive measures that we can take to avoid such an incident? Also, if any more information is needed, please do let me know. Any help would be appreciated. Regards Prayank Saxena _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx