Re: capturing crash dumps

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Tue, 20 Feb 2018 09:15:50 +0100

On 18-02-19 09:22 PM, Sage Weil wrote:
One annoying problem is that when a ceph daemon crashes it will usually
not be noticed. Yes, we spit out a bunch of info to the daemon's log file,
but systemd will restart it automatically and once things come back up
there isn't any persistent notification that anything went long.  Unless
the admin is scrape log files or capturing core files we don't know
anything happened.

How can we fix that?

What if we update the segv crash handler to, in addition to dumping the
recent log and stack trace to the log file, also

  - writes the same information to a standalone file, e.g.
     /var/lib/ceph/crashes/$type.$id/$timestamp

+1 if opt-in.

  - make the daemon check for previous crashes on startup, and report them
to the mgr

+1, many *desktop* apps do similar thing already (check if it crashed, 
report to user if yes).

  - make the mgr keep some record of previous crashes (if not the full log,
just the timestamp so we know when it happened)
     - index/fingerprint by stack trace?

Better yet - if it crashes 3 (or more) times in a row at similar point, 
block it from restarting to prevent flapping pgs in cluster.

  - surface a health warning for recent crashes?

I have mixed feelings about it, as this calls for clearing it manually. But 
then, if OSD dies due to disk failure (ceph asserts on read/write), there 
will be one warning already and it goes away once disk is replaced.

  - make an opt-in mgr function that works similar to python's sentry: post
the crash report to some central archive where developers will hear about
it.

-1, even if opt-in, because such stuff tends to be "opt-in" for a while, and 
then suddenly someone switches it on without everyone noticing. It doesn't 
need to be developer, really.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html