OSD down cause all OSD slow ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We experienced a Ceph failure causing the system to become unresponsive with no IOPS or throughput due to a problematic OSD process on one node. This resulted in slow operations and no IOPS for all other OSDs in the cluster. The incident timeline is as follows:

Alert triggered for OSD problem.
6 out of 12 OSDs on the node were down.
Soft restart attempted, but smartmontools process stuck while shutting down server.
Hard restart attempted and service resumed as usual.

Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17  octopus (stable).

Questions:
1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty node and automatically abandon its resources?
2. Did we miss any patches or bug fixes?
3. Suggestions for improvements to quickly detect and avoid similar issues in the future?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux