OSD down cause all OSD slow ops

petersun@xxxxxxxxxxxx · Mon, 27 Mar 2023 22:32:23 -0000

We experienced a Ceph failure causing the system to become unresponsive with no IOPS or throughput due to a problematic OSD process on one node. This resulted in slow operations and no IOPS for all other OSDs in the cluster. The incident timeline is as follows:

Alert triggered for OSD problem.
6 out of 12 OSDs on the node were down.
Soft restart attempted, but smartmontools process stuck while shutting down server.
Hard restart attempted and service resumed as usual.

Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17  octopus (stable).

Questions:
1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty node and automatically abandon its resources?
2. Did we miss any patches or bug fixes?
3. Suggestions for improvements to quickly detect and avoid similar issues in the future?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx