Re: alert conditions

Jan Fajerski <jfajerski@xxxxxxxx> · Tue, 21 Aug 2018 14:06:15 +0200

Fwiw I added a few things to https://pad.ceph.com/p/alert-conditions and will 
circulate this mail a bit wider.
Or maybe there is not all that much interest in alerting...

On Mon, Jul 23, 2018 at 06:10:04PM +0200, Jan Fajerski wrote:
Hi community,
the topic of alerting conditions for a ceph cluster comes up in 
various contexts. Some folks use prometheus or grafana, (I believe) 
sopme people would like snmp traps from ceph, the mgr dashboard could 
provide basic alerting capabilities and there is of course ceph -s.
Also see "Improving alerting/health checks" on ceph-devel.

Working on some prometheus stuff I think it would be nice to have some 
basic alerting rules in the ceph repo. This could serve as a 
out-of-the-box default as well as a example or best practice which 
conditions should be watched.

So I'm wondering what does the community think? What do operators use 
as alert conditions or find alert-worthy?
I'm aware that this is very open-ended, highly dependent on the 
cluster and its workload and can range from obvious (health_err 
anyone?) to intricate conditions that are designed for a certain 
cluster. I'm wondering if we can distill some non-trivial alert 
conditions that ceph itself does not (yet) provide.

If you have any conditions fitting that description, feel free to add 
them to https://pad.ceph.com/p/alert-conditions. Otherwise looking 
forward to feedback.

jan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com