Hi all, Recently I've heard from a few different people about needs to have nicer alerting in Ceph, both for GUIs and for emitting alerts externally (e.g. over SNMP). I'm keen to make sure we get the right common bits in, to avoid modules having to do their own thing too much. Points that have come up recently: - How to integrate Ceph health checks with alerts generated in Prometheus? - Filtering/muting particular health checks - Customizing severity of individual alert types (e.g. downgrade something they don't care about from ERROR to WARN) - Different severity depending on number of elements affected (e.g. one OSD down is WARN, more is ERROR) - Identifying impacted objects (osds, pools) in the health check's metadata - Making recent history of health checks machine readable (in addition to appearing in the log) In my view, some of these a obviously useful, especially adding metadata to health checks so that they can be usefully hyperlinked in the UI, and keeping that short history so that the UI can show active and recent alerts in the same way without forcing the user to the log. Some of the finer-grained configuration stuff is more debatable -- in practice, I expect that the "page me in the night" piece is usually going to be done by an external monitoring system, and the filtering on "don't page me for X" could happen there. Anyway -- just bringing this up on the list to see what other opinions are out there and if there are more considerations we should add to that list. Cheers, John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html