Hi,
I've a small ceph containerized cluster rolled out with ceph-ansible.
wal and db from each drive are on a seperate nvme drive, the data is on
spinning sas disks. The cluster is running 16.2.7
Today a disk failed, but not quite catastrophic. The block device is
present, lvm metadata is good, but reading certain blocks gives 'Sense:
Unrecovered read error' in the syslog (smart is indicating the drive is
failing). The OSD crashes on reading/writing.
But the container kept restarting and crashing until manual intervention
was done. By doing this the faulty was flapping up and down, causing the
OSD not going out and not rebalancing the cluster.
I could set StartLimitIntervalSec and StartLimitBurst in the osd service
file, but it's not there by default and I like to keep everything as
standard as possible.
What are other options to prevent OSD containers from trying to restart
after a valid crash?
Regards,
Frank de Bot
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx