alert conditions

Jan Fajerski <jfajerski@xxxxxxxx> · Mon, 23 Jul 2018 18:10:04 +0200

Hi community,
the topic of alerting conditions for a ceph cluster comes up in various 
contexts. Some folks use prometheus or grafana, (I believe) sopme people would 
like snmp traps from ceph, the mgr dashboard could provide basic alerting 
capabilities and there is of course ceph -s.
Also see "Improving alerting/health checks" on ceph-devel.

Working on some prometheus stuff I think it would be nice to have some basic 
alerting rules in the ceph repo. This could serve as a out-of-the-box default as 
well as a example or best practice which conditions should be watched.

So I'm wondering what does the community think? What do operators use as alert 
conditions or find alert-worthy?
I'm aware that this is very open-ended, highly dependent on the cluster and its 
workload and can range from obvious (health_err anyone?) to intricate conditions 
that are designed for a certain cluster. I'm wondering if we can distill some 
non-trivial alert conditions that ceph itself does not (yet) provide.

If you have any conditions fitting that description, feel free to add them to 
https://pad.ceph.com/p/alert-conditions. Otherwise looking forward to feedback.

jan

--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com