Re: Ceph Failure and OSD Node Stuck Incident

Frank Schilder <frans@xxxxxx> · Fri, 31 Mar 2023 09:23:38 +0000

I have also seen unresponsive OSDs not being detected as "down". My best bet is that the MON ping probes are handled by a different thread or in a queue different from the disk IO queue, which means that "OSD responding to ping probe" does not mean "OSD is making progress". The latter is what one is and should be interested in. Replies to ping probes should be scheduled within the IO queue and only respond to the MON if queued IO is actually making progress. This seems not to be the case, there is an apparent inconsistency between ping and IO (OSD actually working).

This inconsistency will lead to false positive OSD UP detection by the MONs. I have seen this during our upgrade from mimic to octopus, where it  was just a side effect of some other trouble. In our case, all logical volumes on a disk with multiple OSDs per disk became unresponsive due to something 1 of the OSD did. The MONs, however, did not detect all OSDs on this disk down despite the fact that non of them was responsive to IO requests. Back then I had other problems and didn't extract the log info for that.

I believe I have also seen sufficiently many OSD down reporters report an OSD as down but the MONs ignoring that for some reason. This was in a situation when a host went stale. I tried to diagnose the host and found that the disk controller was acting up. At some point I just rebooted it to get stuff working again. Symptoms were as you describe. Some OSDs marked down, some not, none doing any IO. Its a very rare event and with our SLAs it doesn't matter.

Instead of abandoning ceph, turning to something else and migrating all storage, it might be easier to collect logs, investigate exactly which of the above scenarios might have happened and get it fixed. You should, however, exclude configuration problems.

For example, if you have modified any of the settings mon_osd_reporter_subtree_level, mon_osd_down_out_subtree_limit or mon_osd_min_down_reporters to non-standard values, you might run into bugs. I tested these settings and they do not do what the documentation says. The best way I found to confirm that OSD/host down detection works is to log-in to a host and shut down networking. It is really important not to use anything that can lead to a clean shutdown of an OSD or other ceph daemon. Rip out the network (cables) and see what happens. With custom settings of the above I observe a fail of host down detection exactly in the way you describe. Didn't have time yet to report this. I also can only "test" this on our production system, because this requires a large cluster to make sense.

If you have fresh logs of your incident, please copy them out and try to reconstruct the sequence of events from the OSDs' and MON's perspective. The relevant messages are there with default log level. You should be able to find the point where things went the wrong way and file a bug report. Please keep the logs, these events are not really possible to reproduce and the logs might be the only clue for the devs.

I know I should have done that as well and the issue might be fixed already ... Hope you find the time to do the right thing.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: petersun@xxxxxxxxxxxx <petersun@xxxxxxxxxxxx>
Sent: Thursday, March 23, 2023 8:09 PM
To: ceph-users@xxxxxxx
Subject:  Ceph Failure and OSD Node Stuck Incident

We encountered a Ceph failure where the system became unresponsive with no IOPS or throughput after encountering a failed node. Upon investigation, it appears that the OSD process on one of the Ceph storage nodes is stuck, but ping is still responsive. However, during the failure, Ceph was unable to recognize the problematic node, which resulted in all other OSDs in the cluster experiencing slow operations and no IOPS in the cluster at all.

Here's the timeline of the incident:

- At 10:40, an alert is triggered, indicating a problem with the OSD.
- After the alert, Ceph becomes unresponsive with no IOPS or throughput.
- At 11:26, an engineer discovers that there is a gradual OSD failure, with 6 out of 12 OSDs on the node being down.
- At 11:46, the Ceph engineer is unable to SSH into the faulty node and attempts a soft restart, but the "smartmontools" process is stuck while shutting down the server. Ping works during this time.
- After waiting for about one or two minutes, a hard restart is attempted for the server.
- At 11:57, after the Ceph node starts normally, service resumes as usual, indicating that the issue has been resolved.

Here is some basic information about our services:

- `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age 4w)`
- `Mgr: host005 (active, since 4w), standbys: host001, host002, host003, host004`
- `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)`

We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes. In total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO 870 SSD and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable).

Here is the health check detail:
[root@node21 ~]#  ceph health detail
HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12 pgs peering; Degraded data redundancy: 272273/43967625 objects degraded (0.619%), 88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one blocked for 3730 sec, daemons [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... have slow ops.
[WRN] OSD_DOWN: 1 osds down
        osd.174 (root=default,host=hkhost031) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs peering
        pg 2.dc is stuck peering for 49m, current state peering, last acting [87,95,172]
        pg 2.e2 is stuck peering for 15m, current state peering, last acting [51,177,97]

......
  pg 2.f7e is active+undersized+degraded, acting [10,214]
        pg 2.f84 is active+undersized+degraded, acting [91,52]
[WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... have slow ops.

I have the following questions:

1. Why couldn't Ceph detect the faulty node and automatically abandon its resources? Can anyone provide more troubleshooting guidance for this case?
2. What is Ceph's detection mechanism and where can I find related information? All of our production cloud machines were affected and suspended. If RBD is unstable, we cannot continue to use Ceph technology for our RBD source.
3. Did we miss any patches or bug fixes?
4. Is there anyone who can suggest improvements and how we can quickly detect and avoid similar issues in the future?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx