Hi Peter, I would recommend from my experience to replace the Samsung Evo SSDs, with Datacenter SSDs. Regards, Joachim ________________________________________ Clyso GmbH - Ceph Foundation Member <petersun@xxxxxxxxxxxx> schrieb am Do., 30. März 2023, 16:37: > We encountered a Ceph failure where the system became unresponsive with no > IOPS or throughput after encountering a failed node. Upon investigation, it > appears that the OSD process on one of the Ceph storage nodes is stuck, but > ping is still responsive. However, during the failure, Ceph was unable to > recognize the problematic node, which resulted in all other OSDs in the > cluster experiencing slow operations and no IOPS in the cluster at all. > > Here's the timeline of the incident: > > - At 10:40, an alert is triggered, indicating a problem with the OSD. > - After the alert, Ceph becomes unresponsive with no IOPS or throughput. > - At 11:26, an engineer discovers that there is a gradual OSD failure, > with 6 out of 12 OSDs on the node being down. > - At 11:46, the Ceph engineer is unable to SSH into the faulty node and > attempts a soft restart, but the "smartmontools" process is stuck while > shutting down the server. Ping works during this time. > - After waiting for about one or two minutes, a hard restart is attempted > for the server. > - At 11:57, after the Ceph node starts normally, service resumes as usual, > indicating that the issue has been resolved. > > Here is some basic information about our services: > > - `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age > 4w)` > - `Mgr: host005 (active, since 4w), standbys: host001, host002, host003, > host004` > - `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)` > > We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes. > In total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO > 870 SSD and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version > 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable). > > Here is the health check detail: > [root@node21 ~]# ceph health detail > HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12 > pgs peering; Degraded data redundancy: 272273/43967625 objects degraded > (0.619%), 88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one > blocked for 3730 sec, daemons > [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... > have slow ops. > [WRN] OSD_DOWN: 1 osds down > osd.174 (root=default,host=hkhost031) is down > [WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs > peering > pg 2.dc is stuck peering for 49m, current state peering, last > acting [87,95,172] > pg 2.e2 is stuck peering for 15m, current state peering, last > acting [51,177,97] > > ...... > pg 2.f7e is active+undersized+degraded, acting [10,214] > pg 2.f84 is active+undersized+degraded, acting [91,52] > [WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons > [osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... > have slow ops. > > I have the following questions: > > 1. Why couldn't Ceph detect the faulty node and automatically abandon its > resources? Can anyone provide more troubleshooting guidance for this case? > 2. What is Ceph's detection mechanism and where can I find related > information? All of our production cloud machines were affected and > suspended. If RBD is unstable, we cannot continue to use Ceph technology > for our RBD source. > 3. Did we miss any patches or bug fixes? > 4. Is there anyone who can suggest improvements and how we can quickly > detect and avoid similar issues in the future? > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx