capturing crash dumps

Sage Weil <sweil@xxxxxxxxxx> · Mon, 19 Feb 2018 20:22:41 +0000 (UTC)

One annoying problem is that when a ceph daemon crashes it will usually 
not be noticed. Yes, we spit out a bunch of info to the daemon's log file, 
but systemd will restart it automatically and once things come back up 
there isn't any persistent notification that anything went long.  Unless 
the admin is scrape log files or capturing core files we don't know 
anything happened.

How can we fix that?

What if we update the segv crash handler to, in addition to dumping the 
recent log and stack trace to the log file, also

 - writes the same information to a standalone file, e.g.
    /var/lib/ceph/crashes/$type.$id/$timestamp
 - make the daemon check for previous crashes on startup, and report them 
to the mgr
 - make the mgr keep some record of previous crashes (if not the full log, 
just the timestamp so we know when it happened)
    - index/fingerprint by stack trace?
 - surface a health warning for recent crashes?
 - make an opt-in mgr function that works similar to python's sentry: post 
the crash report to some central archive where developers will hear about 
it.

Things to watch out for:
 - make sure the crash reports don't fill up the disk (only keep the last 
few around?)
 - make any phone home opt-in (obviously)

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html