I like this, there is some similar ideas we probably can borrow from Cassandra on disk failure
# policy for data disk failures:Regards Stanley On 19/09/17 9:16 PM, Manuel Lausch
wrote:
Am Tue, 19 Sep 2017 08:24:48 +0000 schrieb Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx>:I understand what you mean and it's indeed dangerous, but see: https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service Looking at the systemd docs it's difficult though: https://www.freedesktop.org/software/systemd/man/systemd.service.ht ml If the OSD crashes due to another bug you do want it to restart. But for systemd it's not possible to see if the crash was due to a disk I/O- error or a bug in the OSD itself or maybe the OOM-killer or something.Perhaps using something like RestartPreventExitStatus and defining a specific exit code for the OSD to exit on when it is exiting due to an IO error.A other idea: The OSD daemon keeps running in a defined error state and only stops the listeners with other OSDs and the clients. --
Stanley Zhang | Senior Operations Engineer |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com