OSD Container keeps restarting after drive crash

"Frank de Bot (lists)" <lists@xxxxxxxxxxx> · Wed, 16 Feb 2022 23:41:31 +0100

Hi,

I've a small ceph containerized cluster rolled out with ceph-ansible. 
wal and db from each drive are on a seperate nvme drive, the data is on 
spinning sas disks. The cluster is running 16.2.7
Today a disk failed, but not quite catastrophic. The block device is 
present, lvm metadata is good, but reading certain blocks gives 'Sense: 
Unrecovered read error' in the syslog (smart is indicating the drive is 
failing). The OSD crashes on reading/writing.

But the container kept restarting and crashing until manual intervention 
was done. By doing this the faulty was flapping up and down, causing the 
OSD not going out and not rebalancing the cluster.
I could set StartLimitIntervalSec and StartLimitBurst in the osd service 
file, but it's not there by default and I like to keep everything as 
standard as possible.
What are other options to prevent OSD containers from trying to restart 
after a valid crash?

Regards,

Frank de Bot
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx