Re: Who and why can mark OSD down?

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 16 Aug 2018 16:54:39 +0000 (UTC)

On Thu, 16 Aug 2018, Aleksei Gutikov wrote:
> Hi
> 
> I know two possible events triggering OSD down:
> - other OSDs reported about failed peering
> - MON not received heartbeat or report
> 
> All this healthchecks seems to evaluate only networking capabilities of OSD.
> Is there any implemented ways to trigger OSD down if object store stucks?
> Does OSD allowed to mark down itself?

The OSD has various internal checks that will cause it to exit (and thus 
be marked down) if there are problems.  Those include checks for EIO and 
internal heartbeats that will trigger an OSD suicide if critical threads 
gets stuck without making progress.

filestore_op_thread_suicide_timeout = 180
filestore_op_thread_timeout = 60
osd_command_thread_suicide_timeout = 900
osd_command_thread_timeout = 600
osd_op_thread_suicide_timeout = 150
osd_op_thread_timeout = 15

sage