On Thu, Aug 16, 2018 at 12:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Thu, 16 Aug 2018, Aleksei Gutikov wrote: >> Hi >> >> I know two possible events triggering OSD down: >> - other OSDs reported about failed peering >> - MON not received heartbeat or report >> >> All this healthchecks seems to evaluate only networking capabilities of OSD. >> Is there any implemented ways to trigger OSD down if object store stucks? >> Does OSD allowed to mark down itself? > > The OSD has various internal checks that will cause it to exit (and thus > be marked down) if there are problems. Those include checks for EIO and > internal heartbeats that will trigger an OSD suicide if critical threads > gets stuck without making progress. > > filestore_op_thread_suicide_timeout = 180 > filestore_op_thread_timeout = 60 > osd_command_thread_suicide_timeout = 900 > osd_command_thread_timeout = 600 > osd_op_thread_suicide_timeout = 150 > osd_op_thread_timeout = 15 ...and before the OSD kills itself, it will stop responding to network pings from other OSDs if it gets into a bad enough state, so they can mark it down even if it's not actually left the network.