Hello Abdelillah, Thanks for replying Since the kernel was in hung state we were not able to login to the node unless we rebooted. In previous occasions we got below logs whenever the osd was marked down my monitor nodes blk_update_request: I/O error, dev nvme0n1, sector 745934488 blk_update_request: I/O error, dev nvme0n1, sector 745934136 blk_update_request: I/O error, dev nvme0n1, sector 745934168 blk_update_request: I/O error, dev nvme0n1, sector 745933912 blk_update_request: I/O error, dev nvme0n1, sector 745933656 blk_update_request: I/O error, dev nvme0n1, sector 745934216 blk_update_request: I/O error, dev nvme0n1, sector 745934088 blk_update_request: I/O error, dev nvme0n1, sector 745933640 blk_update_request: I/O error, dev nvme0n1, sector 745934104 blk_update_request: I/O error, dev nvme0n1, sector 745933528 Buffer I/O error on dev dm-0, logical block 1802612857, lost async page write Buffer I/O error on dev dm-0, logical block 1802612857, lost async page write Buffer I/O error on dev dm-0, logical block 3750729712, async page read Buffer I/O error on dev dm-0, logical block 0, async page read Buffer I/O error on dev dm-0, logical block 0, async page read Buffer I/O error on dev dm-0, logical block 0, async page read Since we are not logging in /var/log/messages due to storage constraints. As node was in hung state nothing got recorded in any of the logs, be it osd.logs or dmesg since node was rebooted. Regards Prayank Saxena On Thu, 3 Feb 2022 at 03:43, Abdelillah Asraoui <aasraoui@xxxxxxxxx> wrote: > What kind of disk failure are you seeing on disk01? > > On Wed, Feb 2, 2022 at 10:27 AM Prayank Saxena <pr31189@xxxxxxxxx> wrote: > >> Hello Everyone, >> >> >> We encounter some issue that OS hanging on host OSD causes the cluster to >> stop ingesting data. >> >> >> >> Below are CEPH Cluster details: >> >> CEPH Object Storage v14.2.22 >> >> No. of Monitor nodes: 5 >> >> No. of RGW nodes:5 >> >> No.of OSD's:252 (all NVME's) >> >> OS : Centos 7.9 >> >> kernel: 3.10.0-1160.45.1.el7.x86_64 >> >> Network: One NIC port 25G (used as public and cluster network) >> >> Data node hardware spec: >> >> FW: EDB8007Q >> >> MODEL: SAMSUNG MZ4LB15THMLA-00003 >> >> >> >> Our architecture has two osd in one host, the disk01 is shared with kernel >> >> >> >> ┌──────┐ >> >> ┌──────┤disk01│ >> >> ┌────────┤ └──────┘ >> >> │osd-host│ >> >> └────────┤ >> >> │ ┌──────┐ >> >> └──────┤disk02│ >> >> └──────┘ >> >> >> >> disk01 is detected failed and out of the cluster but disk02 is not. >> >> >> >> The Host OSD kernel suddenly hangs (cannot ssh). This happens sometimes >> but >> the OSD residing to that is always flags as down. >> >> This time when the host osd hangs the disk01 is flagged as down and disk02 >> is not flagged as down. At the same time all RGW services alerts as down >> but while checking, services seem to be up, this causes the cluster to >> stop >> ingesting data, only after the host osd is shutdown that we regain the >> cluster ingestion. No other osd's were down and out and the cluster was in >> OK state before we had this incident. >> >> >> >> >> >> *from logs* >> >> -------- >> >> 2022-01-25 15:42:10.994227 mon.cluster01-mon-001 (mon.0) 25030 : cluster >> [DBG] osd.57 reported failed by osd.118 >> >> 2022-01-25 15:42:11.340162 mon.cluster01-mon-001 (mon.0) 25031 : cluster >> [DBG] osd.57 reported failed by osd.248 >> >> 2022-01-25 15:42:11.340435 mon.cluster01-mon-001 (mon.0) 25032 : cluster >> [INF] osd.57 failed (root=default,rack=rack004,host=cluster01-osd-1029) (2 >> reporters from different host after 23.000120 >= grace 20.000000) >> >> 2022-01-25 15:42:11.395065 mon.cluster01-mon-001 (mon.0) 25033 : cluster >> [WRN] Health check failed: 1 osds down (OSD_DOWN) >> >> -------- >> >> >> We have encountered OS hang situations before but everytime osd's were >> marked down by cluster after grace period and it never impacted client >> I/O's. But this time due to the above issue RGW nodes were not able to >> take >> any requests. >> >> Has anyone encountered a similar case where a kernel hang does not flag >> the >> osd as down? >> >> If yes, any proactive measures that we can take to avoid such an incident? >> >> Also, if any more information is needed, please do let me know. >> Any help would be appreciated. >> >> Regards >> Prayank Saxena >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx