Re: Who and why can mark OSD down?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 16 Aug 2018 13:03:43 -0400

On Thu, Aug 16, 2018 at 12:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 16 Aug 2018, Aleksei Gutikov wrote:
>> Hi
>>
>> I know two possible events triggering OSD down:
>> - other OSDs reported about failed peering
>> - MON not received heartbeat or report
>>
>> All this healthchecks seems to evaluate only networking capabilities of OSD.
>> Is there any implemented ways to trigger OSD down if object store stucks?
>> Does OSD allowed to mark down itself?
>
> The OSD has various internal checks that will cause it to exit (and thus
> be marked down) if there are problems.  Those include checks for EIO and
> internal heartbeats that will trigger an OSD suicide if critical threads
> gets stuck without making progress.
>
> filestore_op_thread_suicide_timeout = 180
> filestore_op_thread_timeout = 60
> osd_command_thread_suicide_timeout = 900
> osd_command_thread_timeout = 600
> osd_op_thread_suicide_timeout = 150
> osd_op_thread_timeout = 15

...and before the OSD kills itself, it will stop responding to network
pings from other OSDs if it gets into a bad enough state, so they can
mark it down even if it's not actually left the network.