One annoying problem is that when a ceph daemon crashes it will usually not be noticed. Yes, we spit out a bunch of info to the daemon's log file, but systemd will restart it automatically and once things come back up there isn't any persistent notification that anything went long. Unless the admin is scrape log files or capturing core files we don't know anything happened. How can we fix that? What if we update the segv crash handler to, in addition to dumping the recent log and stack trace to the log file, also - writes the same information to a standalone file, e.g. /var/lib/ceph/crashes/$type.$id/$timestamp - make the daemon check for previous crashes on startup, and report them to the mgr - make the mgr keep some record of previous crashes (if not the full log, just the timestamp so we know when it happened) - index/fingerprint by stack trace? - surface a health warning for recent crashes? - make an opt-in mgr function that works similar to python's sentry: post the crash report to some central archive where developers will hear about it. Things to watch out for: - make sure the crash reports don't fill up the disk (only keep the last few around?) - make any phone home opt-in (obviously) Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html