Improving alerting/health checks

John Spray <jspray@xxxxxxxxxx> · Mon, 25 Jun 2018 11:55:55 +0100

Hi all,

Recently I've heard from a few different people about needs to have
nicer alerting in Ceph, both for GUIs and for emitting alerts
externally (e.g. over SNMP).  I'm keen to make sure we get the right
common bits in, to avoid modules having to do their own thing too
much.

Points that have come up recently:
 - How to integrate Ceph health checks with alerts generated in Prometheus?
 - Filtering/muting particular health checks
 - Customizing severity of individual alert types (e.g. downgrade
something they don't care about from ERROR to WARN)
 - Different severity depending on number of elements affected (e.g.
one OSD down is WARN, more is ERROR)
 - Identifying impacted objects (osds, pools) in the health check's metadata
 - Making recent history of health checks machine readable (in
addition to appearing in the log)

In my view, some of these a obviously useful, especially adding
metadata to health checks so that they can be usefully hyperlinked in
the UI, and keeping that short history so that the UI can show active
and recent alerts in the same way without forcing the user to the log.

Some of the finer-grained configuration stuff is more debatable -- in
practice, I expect that the "page me in the night" piece is usually
going to be done by an external monitoring system, and the filtering
on "don't page me for X" could happen there.

Anyway -- just bringing this up on the list to see what other opinions
are out there and if there are more considerations we should add to
that list.

Cheers,
John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html