Re: Ceph Failure and OSD Node Stuck Incident

"Fox, Kevin M" <Kevin.Fox@xxxxxxxx> · Thu, 30 Mar 2023 16:00:57 +0000

I've seen this twice in production on two separate occasions as well. one osd gets stuck. a bunch of pg's go into laggy state.

ceph pg dump | grep laggy

shows all the laggy pg's share the same osd.

Restarting the affected osd restored full service.

________________________________________
From: Ramin Najjarbashi <ramin.najarbashi@xxxxxxxxx>
Sent: Thursday, March 30, 2023 7:47 AM
To: petersun@xxxxxxxxxxxx
Cc: ceph-users@xxxxxxx
Subject:  Re: Ceph Failure and OSD Node Stuck Incident

Check twice before you click! This email originated from outside PNNL.

On Thu, Mar 30, 2023 at 6:08 PM <petersun@xxxxxxxxxxxx> wrote:

> We encountered a Ceph failure where the system became unresponsive with no
> IOPS or throughput after encountering a failed node. Upon investigation, it
> appears that the OSD process on one of the Ceph storage nodes is stuck, but
> ping is still responsive. However, during the failure, Ceph was unable to
> recognize the problematic node, which resulted in all other OSDs in the
> cluster experiencing slow operations and no IOPS in the cluster at all.
>
> Here's the timeline of the incident:
>
> - At 10:40, an alert is triggered, indicating a problem with the OSD.
> - After the alert, Ceph becomes unresponsive with no IOPS or throughput.
> - At 11:26, an engineer discovers that there is a gradual OSD failure,
> with 6 out of 12 OSDs on the node being down.
> - At 11:46, the Ceph engineer is unable to SSH into the faulty node and
> attempts a soft restart, but the "smartmontools" process is stuck while
> shutting down the server. Ping works during this time.
> - After waiting for about one or two minutes, a hard restart is attempted
> for the server.
> - At 11:57, after the Ceph node starts normally, service resumes as usual,
> indicating that the issue has been resolved.
>
> Here is some basic information about our services:
>
> - `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age
> 4w)`
> - `Mgr: host005 (active, since 4w), standbys: host001, host002, host003,
> host004`
> - `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)`
>
> We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes.
> In total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO
> 870 SSD and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version
> 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable).
>
> Here is the health check detail:
> [root@node21 ~]#  ceph health detail
> HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12
> pgs peering; Degraded data redundancy: 272273/43967625 objects degraded
> (0.619%), 88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one
> blocked for 3730 sec, daemons
> [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]...
> have slow ops.
> [WRN] OSD_DOWN: 1 osds down
>         osd.174 (root=default,host=hkhost031) is down
> [WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs
> peering
>         pg 2.dc is stuck peering for 49m, current state peering, last
> acting [87,95,172]
>         pg 2.e2 is stuck peering for 15m, current state peering, last
> acting [51,177,97]
>
> ......
>   pg 2.f7e is active+undersized+degraded, acting [10,214]
>         pg 2.f84 is active+undersized+degraded, acting [91,52]
> [WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons
> [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]...
> have slow ops.
>
> I have the following questions:
>
> 1. Why couldn't Ceph detect the faulty node and automatically abandon its
> resources? Can anyone provide more troubleshooting guidance for this case?
>

Ceph is designed to detect and respond to node failures in the cluster. One
possible explanation is that the OSD process on the node was stuck and not
responding to the Ceph monitor, preventing the monitor from recognizing the
node as down. To troubleshoot this issue, you can start by checking the
Ceph logs on the failed node to see if there are any error messages related
to the OSD process or any other relevant issues.

> 2. What is Ceph's detection mechanism and where can I find related
> information? All of our production cloud machines were affected and
> suspended. If RBD is unstable, we cannot continue to use Ceph technology
> for our RBD source.
>

Ceph uses a monitoring system called Ceph Monitor to detect node failures
and ensure data consistency across the cluster. The Ceph Monitor
periodically sends health checks to the OSD processes and other Ceph
daemons in the cluster to ensure that they are running correctly. If a node
fails to respond to the health check, the Ceph Monitor marks the node as
down and redistributes its resources to other nodes in the cluster.

> 3. Did we miss any patches or bug fixes?
>

> 4. Is there anyone who can suggest improvements and how we can quickly
> detect and avoid similar issues in the future?

> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx