Hi community,
the topic of alerting conditions for a ceph cluster comes up in various
contexts. Some folks use prometheus or grafana, (I believe) sopme people would
like snmp traps from ceph, the mgr dashboard could provide basic alerting
capabilities and there is of course ceph -s.
Also see "Improving alerting/health checks" on ceph-devel.
Working on some prometheus stuff I think it would be nice to have some basic
alerting rules in the ceph repo. This could serve as a out-of-the-box default as
well as a example or best practice which conditions should be watched.
So I'm wondering what does the community think? What do operators use as alert
conditions or find alert-worthy?
I'm aware that this is very open-ended, highly dependent on the cluster and its
workload and can range from obvious (health_err anyone?) to intricate conditions
that are designed for a certain cluster. I'm wondering if we can distill some
non-trivial alert conditions that ceph itself does not (yet) provide.
If you have any conditions fitting that description, feel free to add them to
https://pad.ceph.com/p/alert-conditions. Otherwise looking forward to feedback.
jan
--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com