Re: Ceph Failure and OSD Node Stuck Incident

Joachim Kraftmayer <joachim.kraftmayer@xxxxxxxxx> · Fri, 31 Mar 2023 19:10:35 +0200

Hi Peter, I would recommend from my experience to replace the Samsung Evo
SSDs, with Datacenter SSDs.
Regards, Joachim
________________________________________
Clyso GmbH - Ceph Foundation Member

<petersun@xxxxxxxxxxxx> schrieb am Do., 30. März 2023, 16:37:

> We encountered a Ceph failure where the system became unresponsive with no
> IOPS or throughput after encountering a failed node. Upon investigation, it
> appears that the OSD process on one of the Ceph storage nodes is stuck, but
> ping is still responsive. However, during the failure, Ceph was unable to
> recognize the problematic node, which resulted in all other OSDs in the
> cluster experiencing slow operations and no IOPS in the cluster at all.
>
> Here's the timeline of the incident:
>
> - At 10:40, an alert is triggered, indicating a problem with the OSD.
> - After the alert, Ceph becomes unresponsive with no IOPS or throughput.
> - At 11:26, an engineer discovers that there is a gradual OSD failure,
> with 6 out of 12 OSDs on the node being down.
> - At 11:46, the Ceph engineer is unable to SSH into the faulty node and
> attempts a soft restart, but the "smartmontools" process is stuck while
> shutting down the server. Ping works during this time.
> - After waiting for about one or two minutes, a hard restart is attempted
> for the server.
> - At 11:57, after the Ceph node starts normally, service resumes as usual,
> indicating that the issue has been resolved.
>
> Here is some basic information about our services:
>
> - `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age
> 4w)`
> - `Mgr: host005 (active, since 4w), standbys: host001, host002, host003,
> host004`
> - `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)`
>
> We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes.
> In total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO
> 870 SSD and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version
> 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable).
>
> Here is the health check detail:
> [root@node21 ~]#  ceph health detail
> HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12
> pgs peering; Degraded data redundancy: 272273/43967625 objects degraded
> (0.619%), 88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one
> blocked for 3730 sec, daemons
> [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]...
> have slow ops.
> [WRN] OSD_DOWN: 1 osds down
>         osd.174 (root=default,host=hkhost031) is down
> [WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs
> peering
>         pg 2.dc is stuck peering for 49m, current state peering, last
> acting [87,95,172]
>         pg 2.e2 is stuck peering for 15m, current state peering, last
> acting [51,177,97]
>
> ......
>   pg 2.f7e is active+undersized+degraded, acting [10,214]
>         pg 2.f84 is active+undersized+degraded, acting [91,52]
> [WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons
> [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]...
> have slow ops.
>
> I have the following questions:
>
> 1. Why couldn't Ceph detect the faulty node and automatically abandon its
> resources? Can anyone provide more troubleshooting guidance for this case?
> 2. What is Ceph's detection mechanism and where can I find related
> information? All of our production cloud machines were affected and
> suspended. If RBD is unstable, we cannot continue to use Ceph technology
> for our RBD source.
> 3. Did we miss any patches or bug fixes?
> 4. Is there anyone who can suggest improvements and how we can quickly
> detect and avoid similar issues in the future?
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx