Re: Clarification of communication between mon and osd

Eugen Block <eblock@xxxxxx> · Mon, 14 Jan 2019 10:32:54 +0000

Thanks for the reply, Paul.

Yes, your understanding is correct. But the main mechanism by which
OSDs are reported as down is that other OSDs report them as down with
a much stricter timeout (20 seconds? 30 seconds? something like that).

Yes, the osd_heartbeat_grace of 20 seconds has occured from time to  
time in setups with network configuration issues.

It's quite rare to hit the "mon osd report timeout" (the usual
scenario here is a network partition)

Thanks for the confirmation.

Eugen

Zitat von Paul Emmerich <paul.emmerich@xxxxxxxx>:

Yes, your understanding is correct. But the main mechanism by which
OSDs are reported as down is that other OSDs report them as down with
a much stricter timeout (20 seconds? 30 seconds? something like that).

It's quite rare to hit the "mon osd report timeout" (the usual
scenario here is a network partition)

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Jan 14, 2019 at 10:17 AM Eugen Block <eblock@xxxxxx> wrote:

Hello list,

I noticed my last post was displayed as a reply to a different thread,
so I re-send my question, please excuse the noise.

There are two config options of mon/osd interaction that I don't fully
understand. Maybe one of you could clarify it for me.

> mon osd report timeout
> - The grace period in seconds before declaring unresponsive Ceph OSD
> Daemons down. Default 900

> mon osd down out interval
> - The number of seconds Ceph waits before marking a Ceph OSD Daemon
> down and out if it doesn’t respond. Default 600

I've seen the mon_osd_down_out_interval beeing hit plenty of times,
e.g. If I manually take down an OSD it will be marked out after 10
minutes. But I can't quite remember seeing the 900 seconds timeout
happen. When exactly will the mon_osd_report_timeout kick in? Does
this mean that if for some reason one OSD is unresponsive the MON will
mark it down after 15 minutes, then wait another 10 minutes until it
is marked out so the recovery can start?

I'd appreciate any insight!

Regards,
Eugen

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com