Re: osd suicide timeout

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 13 Jul 2015 11:07:04 +0100

On Fri, Jul 10, 2015 at 10:45 PM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
> I have an osd log file from an osd that hit a suicide timeout (with the previous 10000 events logged).
> (On this node I have also seen this suicide timeout happen once before and also a sync_entry timeout.
>
> I can see that 6 minutes or so before that osd died, other osds on the same node were logging
> messages such as
>     heartbeat_check: no reply from osd.8
> so it appears that osd8 stopped responding quite some time before it died.

It does this deliberately when its disk threads look like they might be stuck.

>
> I'm wondering if there is enough information in the osd8 log file to deduce why osd 8 stopped responding?
> I don't know enough to figure it out myself.
>
> Is there any expert who would be willing to take a look at the log file?

The logs will have a backtrace in them; if you can include that and
the last hundred lines or so prior in a pastebin that you email the
list with several people can give you a pretty good idea of what's
going on.

In general though, it's just going to be that the disk can't keep up
with the load being applied to it. That could be because it's failing,
or because you're pushing too much work on to it in some fashion.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html