OSD Container keeps restarting after drive crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I've a small ceph containerized cluster rolled out with ceph-ansible. wal and db from each drive are on a seperate nvme drive, the data is on spinning sas disks. The cluster is running 16.2.7 Today a disk failed, but not quite catastrophic. The block device is present, lvm metadata is good, but reading certain blocks gives 'Sense: Unrecovered read error' in the syslog (smart is indicating the drive is failing). The OSD crashes on reading/writing.

But the container kept restarting and crashing until manual intervention was done. By doing this the faulty was flapping up and down, causing the OSD not going out and not rebalancing the cluster. I could set StartLimitIntervalSec and StartLimitBurst in the osd service file, but it's not there by default and I like to keep everything as standard as possible. What are other options to prevent OSD containers from trying to restart after a valid crash?

Regards,

Frank de Bot
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux